Event monitoring and management

ABSTRACT

Described are techniques used in monitoring the performance, security and health of a system used in an industrial application. Agents included in the industrial network report data to an appliance or server. The appliance stores the data and determines when an alarm condition has occurred. Notifications are sent upon detecting an alarm condition. The alarm thresholds may be user defined. A threat thermostat controller determines a threat level used to control the connectivity of a network used in the industrial application.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is based on U.S. Provisional Patent Application No.60/477,088, filed on Jun. 9, 2003, Attorney Docket No. VRS-00160, whichis incorporated by reference herein.

BACKGROUND

1. Technical Field

This application generally relates to a network, and more particularlyto event monitoring and management therein.

2. Description of Related Art

Computer systems may be used in performing a variety of different tasks.For example, an industrial network of computer systems and componentsmay be used in controlling and/or monitoring industrial systems. Suchindustrial systems can be used in connection with manufacturing, powergeneration, energy distribution, waste handling, transportation,telecommunications, water treatment, and the like. The industrialnetwork may be connected and accessible via other networks, bothdirectly and indirectly, including a corporate network and the Internet.The industrial network may thus be susceptible to both internal andexternal cyber-attacks. As a preventive measure from externalcyber-attacks, firewalls or other security measures may be taken toseparate the industrial network from other networks. However, theindustrial network is still vulnerable since such security measures arenot foolproof in the prevention of external attacks by viruses, worms,Trojans and other forms of malicious code as well as computer hacking,intrusions, insider attacks, errors, and omissions that may occur.Additionally, an infected laptop, for example, can bypass the firewallby connecting to the industrial network using a modem, directconnection, or by a virtual private network (VPN). The laptop may thenintroduce worms or other forms of malicious code into the industrialnetwork. It should be noted that an industrial network may besusceptible to other types of security threats besides those related tothe computer systems and network.

Thus, it may be desirable to monitor events of the industrial networkand accordingly raise alerts. It may be desirable that such monitoringand reporting be performed efficiently minimizing the resources of theindustrial network consumed. It may further be desirable to have theindustrial network perform a threat assessment and respond in accordancewith the threat assessment. In performing the assessment, it may also bedesirable to take into account a wide variety of conditions relating toperformance, health and security information about the industrialnetwork, such as may be obtained using the monitoring data, as well asother factors reflecting conditions external to the industrial network.

SUMMARY OF THE INVENTION

In accordance with one aspect of the invention is a method forcontrolling connectivity in a network comprising: receiving one or moreinputs; determining a threat level indicator in accordance with said oneor more inputs; and selecting, for use in said network, a firewallconfiguration in accordance with said threat level indicator. Thefirewall configuration may be selected from a plurality of firewallconfigurations each associated with a different threat level indicator.A first firewall configuration associated with a first threat levelindicator may provide for more restrictive connectivity of said networkthan a second firewall configuration associated with a second threatlevel indicator when said first threat level indicator is a higherthreat level than said second threat level indicator. A firewallconfiguration associated with a highest threat level indicator mayprovide for disconnecting said network from all other less-trustednetworks. The disconnecting may include physically disconnecting saidnetwork from other networks. The network may be reconnected to said lesstrusted networks when a current threat level is a level other than saidhighest threat level indicator. The method may also includeautomatically loading said firewall configuration as a current firewallconfiguration in use in said network. The one or more inputs may includeat least one of: a manual input, a metric about a system in saidnetwork, a metric about said network, a derived value determined using aplurality of weighted metrics including one metric about said network, aderived value determined using a plurality of metrics, and an externalsource from said network. If the manual input is specified, the manualinput may determine the threat level indicator overriding all otherindicators. The plurality of weighted metrics may include a metric aboutat least one of: a network intrusion detection, a network intrusionprevention, a number of failed login attempts, a number of users with ahigh level of privileges. The high level of privileges may correspond toone of: administrator privileges and root user privileges. The selectingadditionally may select one or more of the following: an antivirusconfiguration, an intrusion prevention configuration, and an intrusiondetection configuration.

In accordance with another aspect of the invention is a computer programproduct for controlling connectivity in a network comprising code that:receives one or more inputs; determines a threat level indicator inaccordance with said one or more inputs; and selects, for use in saidnetwork, a firewall configuration in accordance with said threat levelindicator. The firewall configuration may be selected from a pluralityof firewall configurations each associated with a different threat levelindicator. A first firewall configuration associated with a first threatlevel indicator may provide for more restrictive connectivity of saidnetwork than a second firewall configuration associated with a secondthreat level indicator when said first threat level indicator is ahigher threat level than said second threat level indicator. A firewallconfiguration associated with a highest threat level indicator mayprovide for disconnecting said network from all other less-trustednetworks. The code that disconnects may include physically disconnectingsaid network from other networks. The network may be reconnected to saidless trusted networks when a current threat level is a level other thansaid highest threat level indicator. The computer program product mayalso include code that automatically loads said firewall configurationas a current firewall configuration in use in said network. The one ormore inputs may include at least one of: a manual input, a metric abouta system in said network, a metric about said network, a derived valuedetermined using a plurality of weighted metrics including one metricabout said network, a derived value determined using a plurality ofmetrics, and an external source from said network. If the manual inputis specified, the manual input may determine the threat level indicatoroverriding all other indicators. The plurality of weighted metrics mayinclude a metric about at least one of: a network intrusion detection, anetwork intrusion prevention, a number of failed login attempts, anumber of users with a high level of privileges. The high level ofprivileges may correspond to one of: administrator privileges and rootuser privileges. The code that selects may additionally selects one ormore of the following: an antivirus configuration, an intrusionprevention configuration, and an intrusion detection configuration.

In accordance with another aspect of the invention is a method of eventreporting by an agent comprising: receiving data; determining if saiddata indicates a first occurrence of an event of interest associatedwith a metric since a previous periodic reporting; reporting said firstoccurrence of an event if said determining determines said dataindicates said first occurrence; and reporting a summary including saidmetric in a periodic report at a first point in time. The reporting ofsaid first occurrence and said reporting of said summary may beperformed without a request for a report. Data for said reporting ofsaid first occurrence and said reporting of said summary may beperformed by said agent communicating data at an application level to areporting destination using a one-way communication connection. Thereporting of said first occurrence and said summary may also include:opening a communication connection; sending data to said reportingdestination; and closing said communication connection, said agent onlysending data to said reporting destination without reading anycommunication from said communication connection. The communicationconnection may be a TCP or UDP socket. The periodic report may include asummary of a selected set of one or more data sources and associatedvalues for a time interval since a last periodic report was sent to areporting destination. The selected set of one or more metrics may be afirst level of reporting information and said periodic report mayinclude a second level of reporting information used to perform one atleast one of the following: determine a cause of a problem, and take acorrective action to a problem. The reporting of said first occurrenceand said summary may include transmitting messages from said agent to areporting destination, each of said messages being a fixed maximum size.A time interval at which said periodic report is sent by said agent anddata included in each of said messages may be determined in accordancewith at least one of: resources available on a computer system and anetwork in which said agent is included. The agent may execute on afirst computer system and reports data to another computer system. Themethod may also include: monitoring a log file; and extracting saidsecond level of reporting information from said log file, wherein saidlog file includes log information about a computer system upon whichsaid agent is executing. The agent may transmit an XML communication tosaid reporting destination using said communication connection. Athreshold may be specified for an amount of data that said agent canreport in a fixed reporting interval, said threshold being equal to orgreater than a fixed maximum size for each summary report sent by saidagent. A report sent for any of said reporting may include an encryptedchecksum preventing modifications of said report while said report isbeing communicated from an agent to a receiver in a network. Thereporting may be performed by an agent that sends a report, said reportincluding one of: a timestamp which increases with time duration, and asequence number which increases with time duration, used by a receiverof said report. The receiver may use said one of said timestamp or saidsequence number in authenticating a report received by said receiver asbeing sent by said agent, said receiver processing received reportshaving said one of a timestamp or sequence number which is greater thananother one of a timestamp or sequence number associated with a lastreport received from said agent. The second level of reportinginformation may identify at least one source associated with an attack,wherein said source is one of: a user, a machine, and an application,said percentage indicating a percentage of events associated with saidat least one source for a type of attack.

In accordance with another aspect of the invention is a method of eventreporting by an agent comprising: receiving data; determining if saiddata corresponds to an event of interest associated with at least onesecurity metric; and sending a report to a reporting destination, saidreport including said at least one security metric for a fixed timeinterval, wherein said report is sent from said agent communicating dataat an application level to said reporting destination using a one-waycommunication connection. The agent may only sends data on said one-waycommunication connection to said reporting destination without readingany communication from said communication connection. The report mayinclude at least one performance metric in accordance with said datareceived.

In accordance with another aspect of the invention is a method of eventreporting by an agent comprising: receiving data; determining if saiddata indicates a security event of interest; and reporting a summaryincluding information on a plurality of occurrences of said securityevent of interest occurring within a fixed time interval, said summarybeing sent at a predetermined time interval. The reporting of thesummary may be performed without a request for a report. The reportingof said summary may be performed by said agent communicating data at anapplication level to a reporting destination using a one-waycommunication connection. The reporting of said summary may furtherinclude: opening a communication connection; sending data to a saidreporting destination; and closing said communication connection, saidagent only sending data to said reporting destination without readingany communication from said communication connection. The communicationconnection may be a TCP or UDP socket. The agent may transmit an XMLcommunication to said reporting destination using said communicationconnection. The reporting of said summary may include transmittingperiodic messages from said agent to a reporting destination, each ofsaid message having a fixed maximum size.

In accordance with another aspect of the invention is a computer programproduct for event reporting by an agent comprising code that: receivesdata; determines if said data indicates a first occurrence of an eventof interest associated with a metric since a previous periodicreporting; reports said first occurrence of an event if said code thatdetermines that said data indicates said first occurrence; and reports asummary including said metric in a periodic report at a first point intime. The code that reports said first occurrence and said code thatreports said summary may be performed without a request for a report.Data for the code that reports said first occurrence and said code thatreports said summary may be performed by said agent communicating dataat an application level to a reporting destination using a one-waycommunication connection. At least one of said code that reports saidfirst occurrence and said code that reports said summary may furthercomprise code that: opens a communication connection; sends data to saidreporting destination; and closes said communication connection, saidagent only sending data to said reporting destination without readingany communication from said communication connection. The communicationconnection may be a TCP or UDP socket. The periodic report may include asummary of a selected set of one or more data sources and associatedvalues for a time interval since a last periodic report was sent to areporting destination. The selected set of one or more metrics may be afirst level of reporting information and said periodic report mayinclude a second level of reporting information used to perform one atleast one of the following: determine a cause of a problem, and take acorrective action to a problem. The code that reports said firstoccurrence and said code that reports said summary may include code thattransmits messages from said agent to a reporting destination, each ofsaid messages being a fixed maximum size. A time interval at which saidperiodic report is sent by said agent and data included in each of saidmessages may be determined in accordance with at least one of: resourcesavailable on a computer system and a network in which said agent isincluded. The agent may execute on a first computer system and reportsdata to another computer system. The computer program product mayfurther comprise code that: monitors a log file; and extracts saidsecond level of reporting information from said log file, wherein saidlog file includes log information about a computer system upon whichsaid agent is executing. The agent may transmit an XML communication tosaid reporting destination using said communication connection. Athreshold may be specified for an amount of data that said agent canreport in a fixed reporting interval, said threshold being equal to orgreater than a fixed maximum size for each summary report sent by saidagent. A report sent for any of said code that reports may use anencrypted checksum preventing modifications of said report while saidreport is being communicated from an agent to a receiver in a network.The code that reports may be performed by an agent that sends a report,said report including one of: a timestamp which increases with timeduration, and a sequence number which increases with time duration, usedby a receiver of said report. The receiver may use said one of saidtimestamp or said sequence number in authenticating a report received bysaid receiver as being sent by said agent, said receiver processingreceived reports having said one of a timestamp or sequence number whichis greater than another one of a timestamp or sequence number associatedwith a last report received from said agent. The second level ofreporting information may identify at least one source associated withan attack, wherein said source is one of: a user, a machine, and anapplication, said percentage indicating a percentage of eventsassociated with said at least one source for a type of attack.

In accordance with another aspect of the invention is a computer programproduct for event reporting by an agent comprising code that: receivesdata; determines if said data corresponds to an event of interestassociated with at least one security metric; and sends a report to areporting destination, said report including said at least one securitymetric for a fixed time interval, wherein said report is sent from saidagent communicating data at an application level to said reportingdestination using a one-way communication connection. The agent may onlysend data on said one-way communication connection to said reportingdestination without reading any communication from said communicationconnection. The report may include at least one performance metric inaccordance with said data received.

In accordance with another aspect of the invention is a computer programproduct for event reporting by an agent comprising code that: receivesdata; determines if said data indicates a security event of interest;and reports a summary including information on a plurality ofoccurrences of said security event of interest occurring within a fixedtime interval, said summary being sent at a predetermined time interval.The code that reports said summary is performed without a request for areport. Data for said code that reports said summary may be performed bysaid agent communicating data at an application level to a reportingdestination using a one-way communication connection. The code thatreports said summary may further comprises code that: opens acommunication connection; sends data to a said reporting destination;and closes said communication connection, said agent only sending datato said reporting destination without reading any communication fromsaid communication connection. The communication connection may be a TCPor UDP socket. The agent that transmits an XML communication to saidreporting destination may use said communication connection. The codethat reports said summary may include code that transmits periodicmessages from said agent to a reporting destination, each of saidmessage having a fixed maximum size.

In accordance with another aspect of the invention is a method of eventnotification comprising: receiving a first report of a condition;sending a first notification message about said first report of saidcondition; sending a second notification message about said condition ata first notification interval; receiving subsequent reports at fixedtime intervals; sending a subsequent notification message at a secondnotification interval if said condition is still ongoing during saidsecond notification interval, wherein said second notification intervalhas a length which is a multiple of said first notification interval.The first report may be sent from a reporting agent on a first computersystem reporting about one of: said first computer system and a networkincluding said first computer system, and said notification messages aresent from a notification server on a second computer system.Notification messages may be sent to a notification point at successivenotification intervals wherein each of said successive notificationintervals increases approximately exponentially with respect to animmediately prior notification interval. The condition may be associatedwith an alarm condition and an alarm condition may be set when a currentlevel of a metric is not in accordance with a predetermined thresholdvalue. Each of said notification messages may include a first level ofinformation about said condition and a second level of information usedto perform at least one of the following: determine a cause of saidcondition, and take a corrective action for said condition. An optionmay be included in a reporting agent to enable and disable reporting ofsaid second level of information to a notification server from saidagent sending said first report. An option may be used to enable anddisable condition notification messages including said second level ofinformation. An alarm condition may be associated with a first levelalarm and an alarm state of said first level may be maintained when acurrent level of a metric is in accordance with said predeterminedthreshold value until an acknowledgement of said alarm state at saidfirst level is received by said notification server. The alarm conditionmay transition to a second level alarm when said current level is not inaccordance with said predetermined threshold and another thresholdassociated with a second level, and said second level alarm ismaintained when a current level of a metric is in accordance with oneof: said predetermined threshold and said other threshold untilacknowledgement of said second level alarm is received by saidnotification server. Reports may be sent from a reporting agentexecuting on a computer system in an industrial network to an applianceincluded in said industrial network and each of said reports includesevents occurring within said industrial network. An alarm condition maybe determined in accordance with a plurality of weighted metrics, saidplurality of weighted metrics including at least one metric about: anetwork intrusion detection, a network intrusion prevention, a number offailed login attempts, a number of users with a level of privilegesgreater than a level associated with a user-level account.

In accordance with another aspect of the invention is a method of eventnotification comprising: receiving a first report of a condition at areporting destination; and sending a notification message from saidreporting destination to a notification destination, said notificationmessage including a summary of information about events occurring in afixed time interval, said summary identifying at least one of: a sourceand a target associated with an attack occurring within said fixed timeinterval, and a percentage of events associated with said at least oneof said source and said target. The summary may identify at least onesource associated with an attack, wherein said source is one of: a user,a machine, and an application, said percentage indicating a percentageof events associated with said at least one source for a type of attack.The summary may identify at least one target associated with an attack,wherein said target is one of: a user, a machine, an application, and aport, said percentage indicating a percentage of events associated withsaid at least one target for a type of attack. The summary may identifya portion of a type of attack represents with respect to all attacks insaid fixed time interval.

In accordance with another aspect of the invention is a method of eventnotification comprising: receiving report of a potential cyber-attackcondition at fixed time intervals; and sending a notification messageabout said conditions when said conditions exceed a notificationthreshold. A notification threshold may be determined using an alarmcondition in accordance with a plurality of weighted metrics, saidplurality of weighted metrics including at least one metric about: anetwork intrusion detection, a network intrusion prevention, a number offailed login attempts, a number of users with a level of privilegesgreater than a level associated with a user-level account. Thenotification message may include a summary of information about eventsoccurring in a fixed time interval, said summary identifying at leastone of: a source and a target associated with an attack occurring withinsaid fixed time interval, and a percentage of events associated withsaid at least one of said source and said target. The summary mayidentify at least one source associated with an attack, wherein saidsource is one of: a user, a machine, and an application, said percentageindicating a percentage of events associated with said at least onesource for a type of attack. The summary may identify at least onetarget associated with an attack, wherein said target is one of: a user,a machine, an application, and a port, said percentage indicating apercentage of events associated with said at least one target for a typeof attack. The summary may identify a portion of a type of attackrepresents with respect to all attacks in said fixed time interval.

In accordance with another aspect of the invention is a computer programproduct for event notification comprising code that: receives a firstreport of a condition; sends a first notification message about saidfirst report of said condition; sends a second notification messageabout said condition at a first notification interval; receivessubsequent reports at fixed time intervals; and sends a subsequentnotification message at a second notification interval if said conditionis still ongoing during said second notification interval, wherein saidsecond notification interval has a length which is a multiple of saidfirst notification interval. The first report may be sent from areporting agent on a first computer system reporting about one of: saidfirst computer system and a network including said first computersystem, and said notification messages are sent from a notificationserver on a second computer system. Notification messages may be sent toa notification point at successive notification intervals wherein eachof said successive notification intervals increases approximatelyexponentially with respect to an immediately prior notificationinterval. The condition may be associated with an alarm condition and analarm condition is set when a current level of a metric is not inaccordance with a predetermined threshold value. Each of thenotification messages may include a first level of information aboutsaid condition and a second level of information used to perform atleast one of the following: determine a cause of said condition, andtake a corrective action for said condition. An option may be includedin a reporting agent to enable and disable reporting of said secondlevel of information to a notification server from said agent sendingsaid first report. An option may be used to enable and disable conditionnotification messages including said second level of information. Analarm condition may be associated with a first level alarm and an alarmstate of said first level is maintained when a current level of a metricis in accordance with said predetermined threshold value until anacknowledgement of said alarm state at said first level is received bysaid notification server. The alarm condition may transition to a secondlevel alarm when said current level is not in accordance with saidpredetermined threshold and another threshold associated with a secondlevel, and said second level alarm may be maintained when a currentlevel of a metric is in accordance with one of: said predeterminedthreshold and said other threshold until acknowledgement of said secondlevel alarm is received by said notification server. Reports may be sentfrom a reporting agent executing on a computer system in an industrialnetwork to an appliance included in said industrial network and each ofsaid reports includes events occurring within said industrial network.An alarm condition may be determined in accordance with a plurality ofweighted metrics, said plurality of weighted metrics including at leastone metric about: a network intrusion detection, a network intrusionprevention, a number of failed login attempts, a number of users with alevel of privileges greater than a level associated with a user-levelaccount.

In accordance with another aspect of the invention is a computer programproduct for event notification comprising code that: receives a firstreport of a condition at a reporting destination; and sends anotification message from said reporting destination to a notificationdestination, said notification message including a summary ofinformation about events occurring in a fixed time interval, saidsummary identifying at least one of: a source and a target associatedwith an attack occurring within said fixed time interval, and apercentage of events associated with said at least one of said sourceand said target. The summary may identify at least one source associatedwith an attack, wherein said source is one of: a user, a machine, and anapplication, said percentage indicating a percentage of eventsassociated with said at least one source for a type of attack. Thesummary may identify at least one target associated with an attack,wherein said target is one of: a user, a machine, an application, and aport, said percentage indicating a percentage of events associated withsaid at least one target for a type of attack. The summary may identifya portion of a type of attack represents with respect to all attacks insaid fixed time interval.

In accordance with another aspect of the invention is a computer programproduct for event notification comprising code that: receives report ofa potential cyber-attack condition at fixed time intervals; and sends anotification message about said conditions when said conditions exceed anotification threshold. A notification threshold may be determined usingan alarm condition in accordance with a plurality of weighted metrics,said plurality of weighted metrics including at least one metric about:a network intrusion detection, a network intrusion prevention, a numberof failed login attempts, a number of users with a level of privilegesgreater than a level associated with a user-level account. Thenotification message may include a summary of information about eventsoccurring in a fixed time interval, said summary identifying at leastone of: a source and a target associated with an attack occurring withinsaid fixed time interval, and a percentage of events associated withsaid at least one of said source and said target. The summary mayidentify at least one source associated with an attack, wherein saidsource is one of: a user, a machine, and an application, said percentageindicating a percentage of events associated with said at least onesource for a type of attack. The summary may identify at least onetarget associated with an attack, wherein said target is one of: a user,a machine, an application, and a port, said percentage indicating apercentage of events associated with said at least one target for a typeof attack. The summary may identify a portion of a type of attackrepresents with respect to all attacks in said fixed time interval.

In accordance with another aspect of the invention is a method formonitoring an industrial network comprising: reporting first data abouta first computer system by a first agent executing on said firstcomputer system in said industrial network, said first computer systemperforming at least one of: monitoring or controlling a physical processof said industrial network, said first data including information aboutsoftware used in connection with said physical process. The method mayalso include reporting second data about communications on a connectionbetween said industrial network and another network by a second agentexecuting on a second computer system. The second data reported by saidsecond agent may be included in an appliance to which said first data issent. The first agent may report on at least one of: critical filemonitoring, log file for said first computer system, hardware andoperating system of said first computer system, password and login, aspecific application executing on said computer system wherein saidapplication is in accordance with a particular industrial application ofsaid industrial network. A plurality of agents may execute on said firstcomputer system monitoring said first computer system. The plurality ofagents may include a master agent and other agents performing apredetermined set of monitoring tasks, said master agent controllingexecution of said other agents. The plurality of agents may report dataat predetermined intervals to one of: an appliance and said secondcomputer system. The method may also include performing, by at least oneof said plurality of agents: obtaining data from a data source; parsingsaid data; performing pattern matching on said parsed data to determineevents of interest; recording any events of interest; reporting anyevents of interest in accordance with occurrences of selected events ina time interval; creating a message including said summary atpredetermined time intervals; and encrypting at least one of: saidmessage and a checksum of said message. The first data may include atleast one of the following metrics: a number of open listen connectionsand a number of abnormal process terminations. When a number of openlisten connections falls below a first level, an event corresponding toa component failure may be determined. When a number of open listenconnections is above a second level, an event corresponding to a newcomponent or unauthorized component may be determined. The second agentmay report on network activity in accordance with a set of rules, saidrules including at least one rule indicating that events in a businessnetwork are flagged as suspicious in said industrial network. The eventsmay include at least one of: an event associated with a web browser, andan event associated with e-mail. The second agent may report on anaddress binding of a physical device identifier to a network address ifthe physical device identifier of a component was not previously known,or said network address in the address binding is a reassignment of saidnetwork address within a predetermined time period since said networkaddress was last included in an address binding. The second agent mayreport second data about a firewall, and said second data may include atleast one of: a change to a saved firewall configuration correspondingto a predetermined threat level, a change to a current set of firewallconfiguration rules currently controlling operations between saidindustrial network and said other network. Log files associated withsaid firewall may be stored remotely at a location on said secondcomputer system with log files for said second computer system activity.The second data may include at least one threat assessment from a sourceexternal to said industrial network. The second data may include atleast one of: a threat level indicator from a corporate networkconnected to said industrial network, a threat level indicator from apublic network source, and a threat level indicator that is manuallyinput. The method may also include: receiving at least said first databy a receiver; authenticating said first data as being sent by saidfirst agent; and processing, in response to said authenticating, saidfirst data by said receiver. The authenticating may include at least oneof: verifying use of said first agent's encryption key, and checkingvalidity of a message checksum, and using a timestamp or sequence numberto detect invalid reports received by said receiver as being sent fromsaid first agent. The reporting may be performed in accordance with athreshold size indicates an amount of data that said first agent ispermitted to transmit in a fixed periodic reporting interval.

In accordance with another aspect of the invention is a computer programproduct for monitoring an industrial network comprising code that:reports first data about a first computer system by a first agentexecuting on said first computer system in said industrial network, saidfirst computer system performing at least one of: monitoring orcontrolling a physical process of said industrial network, said firstdata including information about software used in connection with saidphysical process. The computer program product may also include codethat reports second data about communications on a connection betweensaid industrial network and another network by a second agent executingon a second computer system. The second data reported by said secondagent may be included in an appliance to which said first data is sent.The first agent may report on at least one of: critical file monitoring,log file for said first computer system, hardware and operating systemof said first computer system, password and login, a specificapplication executing on said computer system wherein said applicationis in accordance with a particular industrial application of saidindustrial network. A plurality of agents may execute on said firstcomputer system monitoring said first computer system. The plurality ofagents may include a master agent and other agents performing apredetermined set of monitoring tasks, said master agent controllingexecution of said other agents. The plurality of agents may report dataat predetermined intervals to one of: an appliance and said secondcomputer system. The computer program product may also include code forperforming, by at least one of said plurality of agents: obtaining datafrom a data source; parsing said data; performing pattern matching onsaid parsed data to determine events of interest; recording any eventsof interest; reporting any events of interest in accordance withoccurrences of selected events in a time interval; creating a messageincluding said summary at predetermined time intervals; and encryptingat least one of: said message and a checksum of said message. The firstdata may include at least one of the following metrics: a number of openlisten connections and a number of abnormal process terminations. When anumber of open listen connections falls below a first level, an eventcorresponding to a component failure may be determined. When a number ofopen listen connections is above a second level, an event correspondingto a new component or unauthorized component may be determined. Thesecond agent may report on network activity in accordance with a set ofrules, said rules including at least one rule indicating that events ina business network are flagged as suspicious in said industrial network.The events may include at least one of: an event associated with a webbrowser, and an event associated with e-mail. The second agent mayreport on an address binding of a physical device identifier to anetwork address if the physical device identifier of a component was notpreviously known, or said network address in the address binding is areassignment of said network address within a predetermined time periodsince said network address was last included in an address binding. Thesecond agent may report second data about a firewall, and said seconddata includes at least one of: a change to a saved firewallconfiguration corresponding to a predetermined threat level, a change toa current set of firewall configuration rules currently controllingoperations between said industrial network and said other network. Logfiles associated with said firewall may be stored remotely at a locationon said second computer system with log files for said second computersystem activity. The second data may include at least one threatassessment from a source external to said industrial network. The seconddata may include at least one of: a threat level indicator from acorporate network connected to said industrial network, a threat levelindicator from a public network source, and a threat level indicatorthat is manually input. The computer program product may also includecode that: receives at least said first data by a receiver;authenticates said first data as being sent by said first agent; andprocesses, in response to said code that authenticates, said first databy said receiver. The code that authenticates may include at least oneof: code that verifies use of said first agent's encryption key andchecks validity of a message checksum, and code that uses a timestamp orsequence number to detect invalid reports received by said receiver asbeing sent from said first agent. The code that reports may use athreshold size indicating an amount of data that said first agent ispermitted to transmit in a fixed periodic reporting interval.

In accordance with another aspect of the invention is a method fordetecting undesirable messages in a network comprising: receiving amessage in said network; determining if said message is undesirable inaccordance with at least one rule defining an acceptable message in saidnetwork; and reporting said message as undesirable if said message isnot determined to be in accordance with said at least one rule. Themethod may also include: defining another rule for use in saiddetermining if an additional message type is determined to be acceptablein said network.

In accordance with another aspect of the invention is a computer programproduct for detecting undesirable messages in a network comprising codethat: receives a message in said network; determines if said message isundesirable in accordance with at least one rule defining an acceptablemessage in said network; and reports said message as undesirable if saidmessage is not determined to be in accordance with said at least onerule. The computer program product may also include code that: definesanother rule for use in said determining if an additional message typeis determined to be acceptable in said network.

In accordance with another aspect of the invention is a method forperforming periodic filesystem integrity checks comprising: receivingtwo or more sets of filesystem entries, each set representing a groupingof one or more filesystem entries; selecting a zero or more entries fromeach set; and performing integrity checking for each selected entry fromeach set during a reporting period. Each of said two or more sets maycorrespond to a predetermined classification level. If a firstclassification level is more important than a second classificationlevel, said first classification level may include less entries thansaid second classification level. A number of entries from each set maybe determined in accordance with a level of importance associated withsaid set.

In accordance with another aspect of the invention is a computer programproduct for performing periodic filesystem integrity checks comprisingcode that: receives two or more sets of filesystem entries, each setrepresenting a grouping of one or more filesystem entries; selects azero or more entries from each set; and performs integrity checking foreach selected entry from each set during a reporting period. Each ofsaid two or more sets may correspond to a predetermined classificationlevel. If a first classification level is more important than a secondclassification level, said first classification level may include lessentries than said second classification level. A number of entries fromeach set may be determined in accordance with a level of importanceassociated with said set.

BRIEF DESCRIPTION OF THE DRAWINGS

Features and advantages of the present invention will become moreapparent from the following detailed description of exemplaryembodiments thereof taken in conjunction with the accompanying drawingsin which:

FIG. 1 is an example of an embodiment of a system described herein;

FIG. 2 is an example of an embodiment of components that may be includedin a corporate network of the system of FIG. 1;

FIG. 3 is a more detailed example of an embodiment of components thatmay be included in an industrial network of the system of FIG. 1;

FIG. 4 is a more detailed example of an embodiment of components thatmay be included in the watch server of FIG. 3;

FIG. 4A is a more detailed example of an embodiment of the threatthermostat controller;

FIG. 5 is an example of the different types of agents that may beincluded in an embodiment on systems from FIG. 3;

FIG. 6 is an example of an embodiment of an architecture of each of theagents from FIG. 5;

FIG. 7 is a flowchart of steps of one embodiment for control flow withinan agent;

FIG. 8 is an example of an embodiment of the real time database andalarm engine (RTAP) of FIG. 4;

FIG. 9 is an example of a representation of a database schema used by anembodiment of RTAP;

FIG. 9A is an example of representing an alarm function within anattribute with the database schema of FIG. 9;

FIGS. 10-11 are examples of embodiments of an alarm state table that maybe used by RTAP;

FIG. 12 is an example of a state transition diagram representing thestates and transitions in the alarm state table of FIG. 11; and

FIG. 13-14 are examples of user interface displays that may be used inan embodiment of the system of FIG. 1.

DETAILED DESCRIPTION OF EMBODIMENT(S)

Referring now to FIG. 1, shown is an example of an embodiment 10 of thesystem that may be used in connection with techniques described herein.The system 10 may be part of an infrastructure used in connection with,for example, manufacturing, power generation, energy distribution, wastehandling, transportation, telecommunications, water treatment, and thelike. Included in the system 10 is a corporate network 12 connectedthrough a hub, switch, router and/or firewall 16 to an industrialnetwork 14. The corporate network 12 may be connected to one or moreexternal networks such as the Internet 20 through a firewall 18 and/orother devices. Also connected to the corporate network 12, eitherdirectly or via the firewall 18, may be a mail server 30, a web server32 and/or any one or more other hardware and/or software components.

It should be noted that although the system 10 of FIG. 1 includes afirewall 18 and may also include one or more other firewalls or securitymeasures, the corporate network as well as the industrial network may besusceptible to cyber attacks and other types of security threats, bothmalicious and accidental. As will be described in following paragraphs,different computer systems that may be included within an embodiment ofthe industrial network 14 must operate in accordance with an extremelyhigh rate of failsafe performance due to the critical applications andtasks that the industrial network may be used in connection with. Inother words, there is a very low tolerance for failure of componentsincluded in the industrial network 14. Loss of control and failurewithin the industrial network 14 may result in much more catastrophicconditions than a failure that may occur within the corporate network12. For example, a catastrophic failure within the corporate network 12may force a back-up retrieval of information. However, in the event thatthe industrial network 14 is being used in connection with supplyingpower or controlling a system such as train switching, failure mayresult in a catastrophic loss in terms of both human and economicdimensions.

In connection with the system 10, it should also be noted that externalthreats such as may be encountered from an external hacker comingthrough the Internet 20 to access the industrial network 14 may onlyaccount for part of the security threats. A large number of cyberattacks and other threats may come from within the system 10 itself suchas, for example, within the corporate network 12 or from within theindustrial network 14. For example, a disgruntled employee may attemptto perform a malicious attack from within the industrial network 14 aswell as within the corporate network 12 in an attempt to cause operationfailure of one or more components of the industrial network 14. Asanother example, someone may connect to the industrial network or thecorporate network 12 using a laptop that might be infected, for example,with a form of malicious codes such as a Trojan, a virus, a worm, andthe like. This malicious code may be introduced within the system 10 onthe corporate network 12 or within the industrial network 14 independentof the firewall 18 and/or firewall 16 functioning. Such types ofinternal threats may not be caught or prevented by the firewall or othersecurity measures developed for preventing primarily external threats.Thus, an embodiment of the system 10 may ideally include and utilizeother techniques in connection with controlling, supervising, andsecuring operation of the components within the system 10 in a failsafemanner.

The corporate network 12 may include components generally used in officeand corporate activities such as, for example, systems used byindividuals in performing accounting functions, and other administrativetasks. The web server 32 may be used, for example, in servicing requestsmade to a website associated with the corporate network 12. Incominge-mail from the internet 20 to the corporate network 12 may be handledby the e-mail server 30. It should be noted that an embodiment of thesystem 10 may include other components than as described herein inaccordance with a particular functionality of each embodiment.

The corporate network 12 may be connected to the industrial network 14through the hub, switch, router, or firewall 16. It should be noted thatthe corporate network 12 may be connected to the industrial network 14by one or more of the foregoing mentioned in connection with element 16.In other words, the element 16 in FIG. 1 may represent a layering orhierarchical arrangement of hardware and/or software used in connectingthe corporate network 12 to the industrial network 14. The differentarrangements of 16 included in an embodiment may vary in accordance witha desired degree of security in accordance with the particular use ofthe components within the industrial network 14.

Included in the industrial network 14 in this embodiment is a Watchserver 50. The Watch server 50 may be characterized as performing avariety of different monitoring, detection, and notification tasks inconnection with the industrial network 14 and connection to thecorporate network. The Watch server 50 is described in more detailelsewhere herein.

Components included in an embodiment of the system 10 may be connectedto each other and to external systems and components using any one ormore different types of communication medium(s). The communicationmediums may be any one of a variety of networks or other type ofcommunication connections as known to those skilled in the art. Thecommunication medium may be a network connection, bus, and/or other typeof data link, such as a hardwire or other connections known in the art.For example, the communication medium may be the Internet, an intranet,network or other non-network connection(s) which facilitate access ofdata and communication between the different components.

It should be noted that an embodiment may also include as element 16other types of connectivity-based hardware and/or software as known tothose of ordinary skill in the art to connect the two networks thecorporate network 12 and the industrial network 14. For example, theelement 16 may also be a dial-up modem connection, a connection forwireless access points, and the like.

The different components included in the system 10 of FIG. 1 may all belocated at the same physical site, may be located at different physicallocations, or some combination thereof. The physical location of one ormore of the components may dictate the type of communication medium thatmay be used in providing connections between the different components.For example, some or all of the connections by which the differentcomponents may be connected through a communication medium may passthrough other communication devices and/or switching equipment that mayexist, such as a phone line, a repeater, a multiplexer, or even asatellite.

Referring now to FIG. 2, shown is an example of an embodiment ofcomponents that may be included within a corporate network 12. Includedin this embodiment 12 of FIG. 2 are user systems 40 a-40 b, and a hub,switch, firewall, or WAN router 42. The component 42 may be used inconnecting this particular corporate network to one or more othercorporate networks, to the firewall 18, and also to any other componentsincluded in 16 previously described in connection with FIG. 1.

Each of the user systems 40 a-40 b may include any one of a variety ofdifferent types of computer systems and components. Generally, inconnection with computer systems included within the corporate network12 as well as in connection with other components described herein, theprocessors may be any one of a variety of commercially available singleor multi-processor systems such as, for example, an Intel-basedprocessor, an IBM mainframe, or other type of processor able to supportthe incoming traffic and tasks in accordance with each particularembodiment and application. Each of the different components, such asthe hub, switch, firewall, and/or router 42, may be any one of a varietyof different components which are commercially available and may also beof a proprietary design.

Each of the user systems 40 a-40 b may include one or more data storagedevices varying in number and type in accordance with each particularsystem. For example, a data storage device may include a single device,such as a disk drive, as well as a plurality of devices in a morecomplex configuration, such as with a storage area network (SAN), andthe like. Data may be stored, for example, on magnetic, optical,silicon-based, or non-silicon-based media. The particular arrangementand configuration may vary in accordance with the parameters andrequirements associated with each embodiment and system.

Each of the user systems 40 a-40 b, as well as other computer systemsdescribed in following paragraphs, may also include one or more I/Odevices such as, for example, a keyboard, a mouse, a display device suchas a monitor, and the like. Each of these components within a computersystem may communicate via any one or more of a variety of differentcommunication connections in accordance with the particular componentsincluded therein.

It should be noted that a corporate network may include other componentsbesides user systems such as, for example, a network printer availablefor use by each user system.

Referring now to FIG. 3, shown is a more detailed example of anembodiment 100 of components previously described in connection with thesystem 10 of FIG. 1. Included in the industrial network 14 in oneembodiment may be a process LAN 102, a control network 104, an I/Onetwork 106, one or more other I/O networks 124 a and 124 b, and a Watchserver 50. In this example, the industrial network 14 may be connectedto the corporate network 12 by the hub, switch, router, or firewall 16.It should be noted that the industrial network 14 may include othercomponents than as described herein as well as multiple instances ofcomponents described herein. In one embodiment, component 16 may be anintegrated security appliance such as, for example, the FortinetFortigate appliance.

The process LAN 102 may be characterized as performing tasks inconnection with data management, integration, display, and the like. Thecontrol network 104 may be used in connection with controlling the oneor more devices within the I/O network 106 as well as one or more otherI/O networks 124 a and 124 b. The Watch server 50 may be characterizedas performing a variety of different monitoring, detection, andnotification tasks in connection with the industrial network 14 andconnection to the corporate network. The Watch server 50 and othercomponents included within an embodiment of 14 described in more detailin the following paragraphs may be used in connection with the operationof the industrial network 14 in order to provide for proper operation ofthe industrial network 14 and component 16 and security threatmanagement.

The process LAN 102 of FIG. 3 includes a switch or hub 110 a connectedto component 16 and one or more other components within the process LAN102. Components included in this example of the process LAN 102 are ahistorian 114 and an application server 116. The historian 114 may beused, for example, in storing a history of the different monitoring datathat may be gathered by other components included within the network 14.The historian 114, for example, may serve as a data archive for thedifferent types of data gathered over time within the network 14. Theapplication server 116 may be used to execute an application thatperforms, for example, process optimization using sensor and other data.The application server 116 may communicate results to the SCADA serverfor use in controlling the operations of the network 14.

The SCADA (Supervisory Control and Data Acquisition) server 118 may beused in remotely monitoring and controlling different components withinthe control network 104 and the I/O network 106. Note also that theSCADA server included in FIG. 2 generally refers to a control system,such as a distributed control system (DCS). Additionally, the SCADAserver 118 may also be responsible for controlling and monitoringcomponents included in other I/O networks 124 a and 124 a. For example,the SCADA server 118 may issue one or more commands to the controller122 in connection with controlling the devices 130 a-130 n within theI/O network 106. The SCADA server 118 may similarly be used inconnection with controlling and monitoring other components within theI/O networks 124 a and 124 b. As known to those of ordinary skill in theart, a SCADA server may be used as part of a large system for remotelymonitoring and controlling, for example, different types of energyproduction, distribution and transmission facilities, transportationsystems, and the like. Generally, the SCADA server 118 may be used inconnection with controlling and remotely or locally monitoring what maybe characterized as components over possibly a large geographicallydistributed area. The SCADA server may rely on, for example,communication links such as radio, satellite, and telephone lines inconnection with communicating with I/O networks 124 a and 124 b as wellas I/O network 106. The particular configuration may vary in accordancewith each particular application and embodiment.

The workstation 120 may include a human machine interface (HMI), such asa graphical user interface (GUI). The workstation 120 may be used, forexample, in connection with obtaining different sensor readings, such astemperature, pressure, and the like, from the devices 130 a-130 n in theI/O network 106, and displaying these readings on the GUI of theworkstation 120. The workstation 120 may also be used in connection withaccepting one or more user inputs in response, for example, to viewingparticular values for different sensor readings. For example, theworkstation 120 may be used in connection with an application monitoringa transportation system. An operator may use the GUI of workstation 120to view certain selected statistics or information about the system. Theselections may be made using the GUI of the workstation 120. Otherinputs from the workstation 120 may serve as instructions forcontrolling and monitoring the operation of different devices andcomponents within the industrial network 14 and one or more I/Onetworks. For example, the transportation system may be used indispatching and monitoring one or more trains.

The SCADA server 118 may also be used in connection with performing dataacquisition of different values obtained by the device sensors 130 a-130n in performing its monitoring and/or controlling operations. The datamay be communicated to the SCADA server 118 as well as the workstation120. The SCADA server 118, for example, may monitor flow rates and othervalues obtained from one or more of the different sensors and mayproduce an alert to an operator in connection with detection of adangerous condition. The dangerous condition or detection may result inan alarm being generated on the workstation 120, for example, such asmay be displayed to a user via the GUI. The SCADA server 118 monitorsthe physical processing within the industrial network and I/Onetwork(s). The server 118 may, for example, raise alerts to an operatorat the workstation 120 when there is a problem detected with thephysical plant that may require attention.

The controller 122 may be used in connection with issuing commands tocontrol the different devices, such as 130 a-130 n, as well convertingsensor signal data, for example, into a digital signal from analog datathat may be gathered from a particular device. An embodiment may alsohave a controller 122 perform other functionality than as describedherein.

The Watch server 50 may be used in connection with monitoring,detecting, and when appropriate, notifying a user in accordance withparticular conditions detected. The Watch server 50 may include a Watchmodule which is included in an appliance. The Watch server 50 may alsobe installed as a software module on a conventional computer system witha commercially available operating system, such as Windows or LINUX, ora hardened operating system, such as SE LINUX. In one embodiment, theWatch server 50 may be, for example, a rack mount server-class computerhaving hardware component redundancy and swappable components. Theappliance or conventional computer system may be executing, for example,SE LINUX on an IBM X-series server that monitors the logs andperformance of the industrial network 14. The foregoing may used inconnection with monitoring, detecting and notifying a human and/orcontrolling computer system or other components where appropriate.

It should be noted that the Watch server 50 may be used in raisingalerts detected in connection with the SCADA system, associatednetworks, and computer processors. In other words, the tasks related tomonitoring the computers and networks of FIG. 3 are performed by theWatch server 50. In contrast, as known to those of ordinary skill in theart, the tasks related to the physical plant processing, sensor datagathering, and the like for controlling and monitoring the operation ofthe particular industrial application(s) are performed by the SCADAserver 118.

Included in an embodiment of the network 14 are one or more agents 132a-132 d that may be used in collecting data which is reported to theWatch server 50. It should be noted that each of the agents 132 a-132 dmay refer to one or more different agents executing on a computer systemto perform data gathering about that computer system. The agents 132a-132 d report information about the system upon which they areexecuting to another system, such as the Watch server 50. The differenttypes of agents that may be included in an embodiment, as well as aparticular architecture of each of the agents, are described in moredetail elsewhere herein. In addition to each of the agents reportinginformation to the Watch server 50, other data gathering components mayinclude an SNMP component, such as 112 a-112 c, which also interact andreport data to the Watch server 50. Each of the SNMP components may beused in gathering data about the different network devices upon whichthe SNMP component resides. As known to those of ordinary skill in theart, these SNMP components 112 a-122 c may vary in accordance with eachparticular type of device and may also be supplied by the particulardevice vendor. In one embodiment, the Watch server 50 may periodicallypoll each of the SNMP components 112 a-112 c for data.

In one embodiment of the industrial network 14 as described above, theWatch server 50 may be executing the SE LINUX (Security Enhanced LINUX)operating system. Although other operating systems may be used inconnection with the techniques described herein, the SE LINUX operatingsystem may be preferred in an embodiment of the Watch server 50 for atleast some of the reasons that will now be described. As known to thoseof ordinary skill in the art, some operating systems may becharacterized as based on a concept of discretionary access control(DAC) which provides two categories of a user. A first category of usermay be an administrator for example that has full access to all systemresources and a second category of user may be an ordinary user who hasfull access to the applications and files needed for a job. Examples ofoperating systems based on the DAC model include for example, aWindows-based operating system. DAC operating systems do not enforce asystem-wide security policy. Protective measures are under the controlof each of the individual users. A program run by a user, for example,may inherit all the permissions of that user and is free to modify anyfile that that user has access to. A more highly secure computer systemmay include an operating system based on mandatory access control (MAC).MAC provides a means for a central administrator to apply system wideaccess policies that are enforced by the operating system. It providesindividual security domains that are isolated from each other unlessexplicit access privileges are specified. The MAC concept provides for amore finely-grained access control to programs, system resources, andfiles in comparison to the two level DAC system. MAC supports a widevariety of categories of users and confines damage, for example, thatflawed or malicious applications may cause to an individual domain. Withthis difference in security philosophy, MAC may be characterized asrepresenting a best available alternative in order to protect criticalsystems from both internal and external cyber attacks. One suchoperating system that is based on the MAC concept or model is the SELINUX operating system. The SE LINUX operating system is available, forexample, at http://www.nsa.gov/selinux.

The components included in the industrial network 14 of FIG. 3, such asthe agents 132 a-132 d, SNMP components 112 a-112 c, and the Watchserver 50, may be used in connection with providing a real time securityevent monitoring system. The different agents 132 a-132 d included inthe industrial network 14 may be installed on the different computersystems included in the industrial network 14 and may report, forexample, on machine health, changes in security log files, applicationstatus, and the like. This information gathered by each of the agents132 a-132 d and SNMP components 112 a-112 c may be communicated to theWatch server 50. This information may be stored in a real-time databasealso included on the Watch server 50. From the Watch server 50, alarmlimits may be set, alerts may be generated, incident reports may becreated, and trends may also be displayed. Watch server 50 may also runa network intrusion detection system (NIDS), and has the ability tomonitor network equipment, such as switches, routers, and firewalls, viathe SNMP components included in various network devices shown in theillustration 100.

The agents 132 a-132 d described herein may be characterized asintelligent agents designed to run and report data while minimizing theuse of system and network resources. A control network 104 may have arelatively low amount of network bandwidth. Accordingly, when deployinga monitoring device, such as an agent 132 a-132 d within such a controlsystem, there may be inadequate bandwidth available for reporting thedata from each agent. The lower bandwidth of a control network istypical, for example, of older legacy systems upon which the variousagents may be deployed. Agents within an embodiment of the network 14may be designed to minimize the amount of CPU usage, memory usage, andbandwidth consumed in connection with performing the various datagathering and reporting tasks regarding the industrial network 14. Inone embodiment, agents may be written in any one or more of a variety ofdifferent programming languages such as PERL, C, Java, and the like.Generally, agents may gather different types of information by executingsystem commands, reading a data file, and the like on the particularsystem upon which the agents are executing. It should be noted that theagents also consume resources in a bounded manner minimizing thevariance of consumption over time. This is described elsewhere herein inmore detail.

Agents 132 a-132 d may format the information into any one of a varietyof different message formats, such as XML, and then report this data tothe Watch server 50. In one embodiment, the agents 132 a-132 d maycommunicate the data to the Watch server 50 over TCP/IP by opening asocket communication channel just long enough to send the relevant dataat particularly selected points in time. In one embodiment, the agents132 a-132 d operate at the application level in communicatinginformation to the Watch server 50. The Watch server 50 does not send anapplication level acknowledgement to such received data. Additionally,the agents never read from the communication channel but rather onlysend data out on this communication channel as a security measure. Itshould be noted that although the embodiment described herein uses TCP,the techniques described herein may be used in an embodiment with UDP oranother type of connectionless communication.

As described in the following paragraphs, the agents that may beincluded in a system 10 of FIG. 1 may be generally characterized as 2different classes of monitoring agents. A first class of agent may beused in monitoring control systems upon which the agent actuallyexecutes. Agents in this first class are those included in theindustrial network 14 of FIG. 3, such as agents 132 a-132 d. A secondclass of agent that may be included in an embodiment is describedelsewhere herein in connection with, for example, the Watch server 50.Agents of this second class may be used in monitoring input from thirdparty equipment or applications or other activity about a system otherthan the system upon which the agent is executing. As described in moredetail elsewhere herein, different types of agents of either class maybe used in an embodiment to gather the different types of data.

It should be noted that the various activities performed by the agentsdescribed herein may be used in connection with monitoring andreporting, for example, on cyber-security incidents as well as theperformance of different system devices such as, for example, thedifferent computer systems for CPU load, space and memory usage, powersupply voltage, and the like. Although one particular type of securitythreat is a cyber-security threat as described herein, it should benoted that the techniques and systems described herein may be used inconnection with monitoring security threats that are of different types.For example, control system damage may be caused, for example, by afaulty power supply, faulty performance of other components, and thelike. The various agents deployed within an embodiment of the system 10of FIG. 1 may be used in detecting both malicious as well as accidentalthreats to the industrial network 14. It may also be useful to correlateperformance information with security information when assessing thelikelihood of system compromise as a result of an attack on one or moresystems.

Although the agents 132 a-132 d are illustrated in connection withparticular components included in FIG. 3, agents 132 a-132 d may also beused in monitoring a controller 122, devices 130 a-130 n, and the likeas needed and possible in accordance with each embodiment. Agents may beused on components having a computer processor capable of performing thetasks described herein and meeting predefined resource limitations thatmay vary with each embodiment. For example, agents may be executed on“smart” controllers, such as 122, if the controller has a processor ableto execute code that performs the agent functionality. There may also bea connection from the controller 122 to the Watch server 50 tocommunicate the agent gathered data to the Watch server 50.

In FIG. 3, it should be noted that each of 113 a, 113 b, and 113 c mayrefer to one or more connections and may vary in accordance with theparticular component connected to the Watch server 50 by eachconnection. For example, component 16 may be a hub, switch, router orfirewall. If component 16 is a switch, element 113 a may refer to twocommunication connections between 16 and the Watch server 50 where oneof the connections is connected to the spanning port of the switch andis used for the monitoring operations for network intrusion detection bythe Watch server 50. In an embodiment in which the component is a hub, asingle connection may be used. The foregoing also applies to connections113 b and 113 c and, respectively, components 10 a and 10 b.

Referring now to FIG. 4, shown is an example of an embodiment ofcomponents that may be included in a Watch server 50. Watch server 50 inone embodiment may include a threat agent 200, an SNMP Watch agent 202,an SNMP Guard Agent 203, a NIDS agent 204, an ARPWatch agent 206, and aGuard log agent 209. Data 201 produced by the agents executing in theindustrial network may be received by the Watch server. The Watch server50 itself may be monitored using one or more agents 208 of the firstclass. Data from these agents is referenced as 208 in FIG. 4. Each of200, 201, 202, 203, 204, 206, 208 and 209 communicates with the receiver210 to store data in RTAP (real-time database and alarm engine) 212. TheWatch server 50 may also include a web server 214, a notification server216, a threat thermostat controller 218 and one or more firewallsettings 220.

The agents included in the Watch server 50, with the exception of theagents reporting data 208, are of the second class of agent describedelsewhere herein in which the agents included in the Watch server gatheror report information about one or more systems other than that systemupon which the agent is executing. It should be noted that this class ofagent is in contrast, for example, to the agents 132 a-132 d previouslydescribed in connection with FIG. 3 which are executed on a computersystem and report information to the Watch server 50 about theparticular computer system upon which the agent is executing. The agentsincluded within the Watch server 50 may be used in gathering data fromone or more sources. As described in connection with the agents of thefirst class of agent, the second class of agents may be written in anyone or more programming languages.

The threat agent 200 receives threat assessments from one or moresources which are external to the industrial network 14. The inputs tothe component 200 may include, for example, a threat assessment level oralert produced by the corporate network, a security or threat levelproduced by the US government, such as the Homeland Security Threatlevel, an input published on a private site on the Internet, and thelike.

Data from agents of both classes executing within the industrial network14 of FIG. 3 communicate data to the receiver 210 as input source 201.It should be noted that any one or more of the metrics described hereinmay be reported by any of the agents in connection with a periodicreporting interval, as well as in accordance with the occurrence ofcertain thresholds or events being detected.

The SNMP Watch agent 202 periodically polls the different devicesincluding hubs, switches, and routers having a vendor supplied SNMPcomponent, such as 112 a and 112 b. The Watch server performs thisperiodic polling and communicates with each of the SNMP components 112a-112 b to obtain information regarding the particular component ornetwork device. SNMP Components, such as 112 a, report data to agent 202such as, for example, metrics related to the activity level on switchesand the like being monitored. Similarly, the SNMP Guard agent 203periodically polls the firewall(s) for data. In one embodiment, thecomponent 203 works with the Fortinet Fortigate series of firewalls. Theparticular firewall(s) supported and utilized may vary in accordancewith each embodiment. The components 202 and 203 use the SNMP protocolto request the information set forth below from network switches,firewalls, and other network equipment.

In one embodiment, the following metrics may be reported by 203 at theend of each periodic reporting interval. It should be noted that theunits in an embodiment may vary from what is specified herein. Also, notall metrics may be available and tracked by all devices:

Uptime—How long the device has been running continuously since it waslast rebooted or reset.

Configuration—Information from the device that describes the device.This may include, for example, the name of the vendor, the model number,the firmware version, and any other software or ruleset versions thatthe device supports and may be relevant to an understanding of thebehavior of the device.

Communications Status—An indication as to whether the device's SNMPcomponent responded to the previous and/or current SNMP requests forinformation.

Total incoming and outgoing network traffic, in kilobytes, per reportinginterval.

Per-interface incoming and outgoing network traffic, in kilobytes, for areporting interval.

% CPU Load—The percentage of CPU Utilization in the reporting interval.

% Disk space—For devices with disks, like some firewalls, report thepercentage used for every filesystem or partition.

% Memory Used—Report the fraction of physical memory in use.

Open Session Count—a count of open communications sessions.

VPN Tunnel count—a count of the number of machines and/or usersconnected through a firewall device using an IPSEC, SSL or other VirtualPrivate Network (VPN) technology. For each such connection, also reportthe source IP address, and if available:

host name, user name, certificate name and/or any other information thatmight serve to identify who or what is connected to the control networkvia the VPN connection.

Administrative User Count—A count of how many “root” or otheradministrative users are logged into the device and so are capable ofchanging the configuration of the device. For each such user, when theinformation is available via SNMP, report the user name, source IPaddress and any other information that might help to identify who islogged in.

With reference to the foregoing metrics, in one embodiment, the agent202 may report at periodic intervals uptime, total incoming and outgoingnetwork traffic, and information regarding the particular operatingsystem, version number and the like.

The NIDS agent 204 monitors the process and control LAN communicationsby scanning messages or copies of messages passing through LANS, hubs,and switches. It should be noted that referring back to FIG. 3, adedicated connection between a Watch server and these components may beused in connection with performing NIDS monitoring. The NIDS Agent 204receives data and determines metrics, for example, such as a messagecount, network intrusion alerts raised, and the like. The NIDS agent 204may be used in connection with monitoring both the process LAN and thecontrol LAN communications. As an input to the NIDS agent 204, data maycome from the NIDS component 204 a. Data from LANs, hubs and switchesmay be input to the NIDS component 204 a and used in connection withdetection of network intrusions. Included within the NIDS component 204a is a library of signatures that may be used in connection withdetecting different types of network intrusions. It should be noted thatthe NIDS agent 204 may be used in performing real time traffic analysisand packet logging of control networks. The NIDS component 204 a mayalso be used in connection with performing protocol analysis, contentsearching and matching, and may be used to detect a variety of attacksand different types of probes including, for example, buffer overflows,stealth port scans, CGI attacks, SMB probes, fingerprinting attempts andthe like. The NIDS agent 204 may also be used in connection withmonitoring an existing third party network intrusion detection systeminstalled on a control network.

In one embodiment, the NIDS component 204 a is implemented using SNORTtechnology. As known to those of ordinary skill in the art, SNORT isdescribed, for example, at www.snort.org, and may be used in connectionwith monitoring network traffic, for example, by monitoring andanalyzing all messages on the network, such as through one of theconnections connected into a spanning port of a switch used in anembodiment of FIG. 3. SNORT reports data on any messages exchangedbetween any two ports. SNORT uses a pattern matching engine whichsearches one or more messages for known patterns. The known patterns areassociated with known viruses, worms, and the like. For example, amessage packet may include a particular bit pattern indicative of aknown worm. The NIDS agent may be implemented using both customizedand/or conventional NIDS engines. In one embodiment using the SNORTtechnology, the SNORT open source NIDS engine may be used with theentire SNORT NIDS rules set. An embodiment may also use other NIDStechnologies such as, for example, the Fortinet Fortigate NIDS systemwith the Fortinet Fortigate rules set. An embodiment may also use morethan one NIDS system. Specific rules may be disabled or made morespecific at particular sites if normal background traffic at the site isfound to generate an unacceptable number of false positive alerts as aresult of enacting particular rules. This may vary in accordance witheach embodiment. An embodiment may connect one or more NIDS engines tothe hubs and other components of the industrial network. In the eventthat the industrial network includes switches, the NIDS engines may beconnected to the spanning ports on the switches as described elsewhereherein. If spanning ports are not available, it may be preferable toconnect the NIDS engines to monitor the traffic between the industrialnetwork 14 and the corporate network 12, using an Ethernet tap, hub, orother technique as known to those of ordinary skill in the art.

In one embodiment of the NIDS component 204 a, customized signatures maybe used in addition to those that may be supplied with the NIDStechnology such as publicly available at the Snort.org website. Theseadditional signatures may identify network traffic that would be“normal” on a business network, but is not associated with normalactivity within an industrial network. These may include, for example,signatures identifying one or more of the following: telnet logintraffic; ftp file transfer traffic; web browsing through normal andencrypted connections; reading email through POP3, IMAP and Exchangeprotocols; sending email through SMTP and Exchange protocols; and usingany of the instant messaging products, such as those from Microsoft,Yahoo, and America Online. The foregoing may be considered normaltraffic on a business network, such as the corporate network 12, but notwithin a dedicated-purpose network like the industrial network 14. Itshould be noted that plant operators within the network 14 may haveaccess to email and other such facilities, but such access may be gainedusing workstation computers that, while they may sit physically besidecritical equipment, may actually be connected to the corporate networkand not to the industrial network.

In one embodiment, the following additional NIDS signatures may be usedto report all attempts to communicate with well-known UDP and TCP portshaving support which is turned off on correctly-hardened Solarismachines running the Foxboro IA industrial control system software: Portnumber application or service 7 (ECHO) 9 (DISCARD) 13 (DAYTIME) 19(CHARGEN) 21 (FTP) 23 (TELNET) 25 (SMTP) 79 (FINGER) 512 (REXEC) 513(RLOGIN) 514 (RSH) 540 (UUCP) 1497 (RFX-LM)The foregoing are pairings of a port number and an application orservice that may typically access this port for communications. Inconnection with an industrial network, there should typically be nocommunications for these ports using the TCP or UDP communicationprotocols by the above service or application. If any suchcommunications are observed, they are flagged and reported. Theforegoing pairings are well-known and described, for example, in RFC1700 Assigned Number, section entitled “Well-known Port Numbers”, athttp://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1700.html.

An embodiment may also include additional signatures representative oftraffic that should not be present on an industrial network. Thesignatures below may be used, for example, to monitor and report onports used for TCP and/or UDP to identify uses that may becharacteristic and typical within the corporate or other businessnetwork, but are not desirable for use within an industrial network.This may include, for example, using particular applications andservices such as Microsoft's NetMeeting, AOL and other Instant Messagingservices, and the like.

For both TCP and UDP, report communications for the following: Portnumber application or service 80 (HTTP) 443 (HTTPS) 143 (IMAP) 993(IMAPS) 110 (POP3) 995 (POP3S) 989 (FTPS-DATA) 990 (FTPS) 992 (TELNETS)389 (LDAP-NETMEETING) 552 (ULS-NETMEETING) 1503 (T.120-NETMEETING) 1720(H.323-NETMEETING) 1731 (MSICCP-NETMEETING) 1590 (AOL-AOL_IMESSENGER)194 (IRC)

For embodiments using Yahoo Instant Messenger, report activity on thefollowing:

-   -   TCP ports 5000, 5001, 5050, 5100    -   UDP ports 5000-5010

For embodiments using Microsoft Instant Messenger, report activity onthe following:

-   -   TCP ports 1863, 6901 for application/service: (VOICE)    -   TCP ports 6891-6900 for application/service: (FILE-XFER)    -   UDP port 6901 for application/service: (VOICE)

Note that the particular signatures used may vary with that activitywhich is typical and/or allowed within each embodiment.

The foregoing are examples of different activities that may not becharacterized as normal within operation of an industrial network 14.Any additional signatures used may vary in accordance with eachembodiment.

In one embodiment of the NIDS component 204 a, customized signatures maybe used in an inclusionary manner. Exclusionary NIDS signatures, such asthose illustrated above, typically identify and report upon abnormal orundesirable messages detected on the network being monitored. Asadditional undesirable, abnormal or atypical messages are identified,new and corresponding NIDS signatures are developed for use in theembodiment. An embodiment may use inclusionary NIDS signatures toidentify messages that are part of the network's normal, or desirableoperating state and report upon any message not identified as such. Newinclusionary NIDS signatures are developed in such an embodimentwhenever a type of message is determined to be within the normal oracceptable behavior of a given network. In an embodiment using theinclusionary signatures, no additional signatures are needed to identifynew undesirable or abnormal messages. Inclusionary signatures thereforeincur lower signature update costs than do exclusionary signatures onnetworks whose normal or desirable network or message traffic patternschange infrequently. On such networks, it can be argued thatinclusionary signatures may provide greater security, because they areimmediately able to identify new abnormal or undesirable messages,without waiting for exclusionary signatures to be identified, developedor installed.

In one embodiment, the NIDS agent 204 may have a reporting intervalevery 10 seconds. The password age agent 310, described elsewhereherein, may have a reporting interval of every hour. The other agentsmay have a reporting interval of once every minute which may vary inaccordance with the resources within each embodiment and computersystems therein.

The ARPWatch agent 206 may detect one or more events of interest inconnection with the ARP (address resolution protocol) messagesmonitored. ARP may be used with the internet protocol (IP) and otherprotocols as described in RFC 826 athttp://www.cis.ohio-state.edu/cgi-bin/rfc/rfc0826.html. In oneembodiment, the ARPWatch 206 a is based on open source software, asdescribed in ftp://ftp.ee.lbl.gov/arpwatch.tar.gz, with additionalfunctionality described herein to minimize false positives. As known tothose of ordinary skill in the art, ARP may be used in performing IPaddress resolution and binding to a device address. The ARPWatch 206 amay, for example, look for any new devices, such as computers, detectedon a network. The ARPWatch 206 a may monitor the industrial network 14for the introduction of new types of devices and log informationregarding these new devices. The ARPWatch agent 206, for example, mayuse information on new devices provided by 206 a and report informationto RTAP in a fashion similar to that used with the NIDS agent usingSNORT based technology. For example, a laptop computer may be turned onand connected for a short time period to a network by a wirelessconnection. When the laptop is introduced into the network, a networkaddress is associated with the laptop while in the network. This may bedone by binding an IP address to the device address unique to theparticular laptop. When first introduced into the network, there is noknown IP address for the laptop. In connection with assigning an IPaddress to this new laptop, one or more messages may be monitored in thenetwork to detect this event of introducing a new device in the networkcausing a new binding of an IP address to the unique address of thelaptop. The IP addresses associated with the device may be assigned on avisitor or temporary basis, such as with the laptop. Thus, IP addressesmay be reused such that a same IP address may be bound to differentdevices at different points in time.

In one embodiment, a customized version of ARPWatch may raise an alertif the device address of the computer was not previously known oractive. An alert may also be raised if the device address was assignedto some other IP address or the IP address had some other device addressassigned to it within a prior time period, for example, such as 15-20minutes. Within an embodiment, the prior time period may vary. It shouldbe noted that the prior time period may be long enough to be confidentthat the device to which the other IP address was assigned is no longerconnected to the network.

Such activity may be a sign of, for example, ARP spoofing or ARPpoisoning. As known in the art, ARP spoofing occurs when a forged ARPreply message is sent to an original ARP request, and/or forged ARPrequests are sent. In the forged replies and/or forged requests, theforger associates a known IP address with an incorrect device address,often the forger's own device address. Thus, receiving the forged ARPmessages causes a reassignment of the known IP address to the address ofthe forger's device, in one or more other devices on the network. Thiscan also cause the network's own table of address bindings to be updatedor poisoned with the forged binding. The time period described above maybe used to minimize the false positives as may be generated when usingthe standard open source ARPWatch in IP-address-poor DHCP environments.

The agent 206 in one embodiment uses a version of ARPWatch that issuesfewer messages than the conventional open source version for ARP framescontaining invalid IP addresses that do not fall within the localnetwork mask. The conventional open source ARPwatch logs a message forevery ARP frame containing an invalid IP address resulting in up to 100messages per minute for a given invalid IP address detection.

The ARPWatch 206 a in one embodiment keeps track of the invalidaddresses detected and reports invalid IP/device address binding for aslong as the ARPWatch 206 a is running. The RTAP 212 may also trackchanged IP addresses reported by the ARPWatch 206 a via the ARPWatchagent 206, and the web server 214 may then present a list of these tothe network administrator for approval such that these addresses are nolonger considered as “new.” An embodiment may also provide functionalityfor approving new and/or changed IP/device address bindings, and/orprovide functionality for individually approving new IP addresses and/ornew and/or changed IP/device address bindings as well as approving theentire list presented at once. Once the foregoing new device address/IPbindings are approved, the number of “new” network devices drops to zeroand any RTAP alert that may be outstanding against that count is resetto a normal condition or state.

In one embodiment, agents of the first class described elsewhere hereinmay be executing on the Watch server monitoring the health, performance,and security of the Watch server. The data from these agents 208 is sentto the receiver 210 to report data about the Watch server 50. Theparticular types of this first class of agent are described elsewhereherein, for example, in connection with FIGS. 5 and 6. An embodiment mayinclude any number of these types of agents of the first class to reportdata about the Watch server 50.

The Guard log agent 209 monitors log files of the Watch server 50 andalso the firewall log files of one or more firewalls being monitored,such as an embodiment including a firewall as component 16 of FIG. 3. Inone embodiment, an option may be set on the Fortinet Fortigate firewallto automatically transmit all firewall log information to the Watchserver 50. On the Watch server 50, the received firewall log informationmay be included as part of the system logging for the Watch server 50.In one embodiment, the system log files of the Watch server may besegregated into different physical files of the file system. Thefirewall log files may be included in a separate physical file in theWatch Server's log file area. The Guard log agent 209 may also obtainadditional information from the firewall using SSH (Secure SHell). Theagent 209, using SSH (Secure SHell), remotely logs into a machine via ashell. As known to those of ordinary skill in the art, SSH may becharacterized as similar in functionality to telnet, however unliketelnet, all data exchanged is encrypted. In one embodiment, the agent209 may download a copy of the firewall rules and other configurationinformation currently in use in a firewall using SSH. Other embodimentsmay use other secure command and communications technologies such as,for example, IPSEC, sftp, HTTPS, SSL, and the like. The agent 209 mayhave a master set of firewall configuration information which it expectsto be currently in use. The downloaded copy of firewall configurationinformation currently in use may be compared to the master set todetermine any differences. In the event that any differences aredetected, an alert may be raised and signaled by RTAP 212.

In one embodiment, the following metrics may be reported by the Guardlog agent 209 in connection with the firewall log files every reportinginterval:

firewall violations—Report the total number of firewall violationsmessages detected in the reporting interval. On platforms thatdistinguish different kinds of violation messages, report those countsseparately. For example, one embodiment uses ipchains which, as known tothose of ordinary skill in the art, is available on some versions ofLinux to set up, maintain and inspect the IP firewall rules in the Linuxkernel. Ipchains provides for distinguishing “dropped” packets from“rejected” packets, and these counts may be reported out separately.

Report the first three firewall violation messages detected for eachtype of violation message in each reporting interval.

Report a summary information for each types of firewall violationmessage. The summary information may include, for example, one or moreIP addresses identified as the source for most of the messages, whatpercentage of the messages are associated with each IP address, one ormore IP addresses that are the target in a majority of the messages, andwhat percentage of messages are associated with each target address.

Network IDS (intrusion detection system) and IPS (intrusion preventionsystem) reports—Report the total number of intrusion detection andprevention reports logged in the reporting period. For systems thatreport different kinds or priorities or severities of intrusionattempts, report the total number of each class of attempts as separatemetrics.

For each separately-reported class of intrusion attempts, report thefirst three attempts logged in each reporting interval as well as thetotal number of attempts.

Summary reporting information for each class of intrusion attempt—Reportthe most common source IP address, destination IP address and attacktype and the percentage of total attempts in that class of attempt thathad that most common source IP, destination IP or attack type.

Firewall configuration change—This is described above in which the agent209 may report a boolean value indicating whether any aspect of theindustrial network firewall configuration has changed at the end of thereporting interval. As described above, the agent 209 agent usesfirewall-specified technologies (eg: ssh or tftp) to download thecurrently active firewall configuration to the server 50. If theconfiguration downloaded at the end of one reporting interval differsfrom the configuration downloaded at the end of the previous reportinginterval, the agent 209 reports a one, otherwise if the downloadedconfiguration differs from the saved firewall settings for the currentthreat level, the agent 209 reports a one, otherwise if any of the savedfirewall settings for any threat level have changed in the reportinginterval, the agent reports a one, otherwise it reports a zero. Theagent 209 also reports a one-line summary of what area of the firewallconfiguration has changed, such as, for example, the saved settings, thedownloaded configuration, and the like, with further detail as to whatpart of the configuration changed, such as, for example, the firewallrules, the number of active ports, address translation rule, and thelike. If the configuration has changed, the alert may remain in theelevated alarm state until an authorized administrator, for example,updates the saved configuration data (firewall configuration set) on theWatch server 50 as the new master set of configuration data.

It should be noted that in industrial networks, for example, pathsthrough the firewall may be opened temporarily by a short term firewallconfiguration change such as while short-term projects or contractorsare on-site. The firewall configuration change metric allows forautomatic tracking and determination of when there has been such achange. In response to a one for this metric, for example, RTAP 212 maygenerate an alert condition. This condition may continue to be trackedas long as the configuration is non-standard until the configuration isrestored to a firewall configuration known to be safe.

Threat thermostat configuration change—Reports the number of savedfirewall configurations corresponding to threat thermostat threat levelsthat have changed in the reporting interval.

Note that the agent 209 keeps a copy of the saved firewallconfigurations corresponding to the different threat levels. At the endof the reporting interval, the agent 209 compares the copies to thecurrent firewall configurations corresponding to the different threatlevels and reports if a particular pairing of firewall rule sets with anassociated threat level has changed, or if there has been a change toany one of the rules sets. If there have been any changes, then afterreporting, the agent 209 replaces its set of saved firewallconfigurations with the modified firewall configurations. For everyconfiguration that changed in a reporting period, the agent 209 alsoreports a one-line summary of what has changed in the configuration.

Other activity and innocuous activity filters—In log files of firewallsand other systems, various metrics may be derived from a master systemlog file. The “other activity” metric is a count of “unusual” messagesdetected in the master log file during the reporting interval. A set ofmessages may be defined as unusual using a default and/or user specifiedset. For example, unusual messages to be included in this metric may befiltered using regular expressions. Any line of the log file that is notcounted as some other metric is a candidate for “other activity.” These“other activity” candidate lines are compared to each of the savedregular expressions and discarded if there is a match with anyexpression. Otherwise, the candidate log entry is counted as an incidentof “unusual activity”. When all the log entries generated during thereporting interval are processed, the “unusual” count is reported as thevalue of the “other activity” metric.

The receiver 210 is used to interface with RTAP 212. In one embodiment,use of facilities in RTAP 212 is in accordance with a predefinedinterface or API (application programming interface). One of thefunctions of the receiver 210 is to convert the agent protocol datareceived into a format in accordance with a known RTAP API in order topopulate the database of RTAP 212. Additionally, the receiver 210 mayperform agent authentication of messages received. For example, in oneembodiment, a private unique key may be used by each device or processorsending a message to the Watch server 50. The receiver 210 knows theseprivate keys and uses these to authenticate the received messages asbeing from one of the expected devices. The receiver 210 records the IPaddress reporting every metric and rejects new values reported for ametric from any IP address but the last address to report legitimatelyto the metric. Other embodiments may use other encryption techniques andaccordingly may use different techniques in authenticating receivedmessages.

The notification server 216 and/or the web server 214 may be used inconnection with providing incident notification. When configuring thesecurity or system performance metrics used by the Watch server 50, auser may specify an e-mail address or other destination for receivingdifferent types of alert notifications that may be produced. Thenotification in the event of an alert may be sent, for example, to aPDA, pager, cell phone, and the like upon the occurrence of an eventdetermined by the Watch server. Such an event may include, for example,reported data reaching predetermined alarm or other thresholds,detection of a cyber-attack, detection of a component failure within theindustrial network requiring immediate repair, and the like. Once a userhas been notified upon such a device, the user may then use a webbrowser to gain secure access to the Watch server allowing one toexamine the problem and acknowledge any one or more alarms. Thenotification server 216 may also send a message to a direct phoneconnection, such as to a phone number, rather than an e-mail address.

The web server 214 may be used in connection with displaying informationand/or accepting input from a user in connection with any one of avariety of different tasks in this embodiment. Other embodiments may useconventional command line, Windows, client/server or other userinterfaces known to those of ordinary skill in the art. In thisembodiment, for example, the web server 214, through a web browser, maybe used in displaying a security metric set-up page allowing forcustomization of security conditions, and definitions used for recordingand alarming. For example, a user may specify limit values or thresholdsassociated with different warnings or alarms. When such thresholds havebeen reached, a notification message may be sent to one or morespecified devices or addresses. Additionally, the web server 214 inconnection with a browser may be used, for example, in connection withdisplaying different types of information regarding a security status,details of selected metrics, and the like. In connection with the use ofthe web server 214 and a browser, the different agents may be configuredand monitored.

The web server 214 may be any one of a variety of different types of webservers known to those of ordinary skill in the art such as, forexample, a TOMCAT web server. The web server 214 may be used inconnection with obtaining input and/or output using a GUI with a webbrowser for display, browsing, and the like. The web server 214 and abrowser may be used for local access to appliance data as well as remoteaccess to appliance data, such as the RTAP 212 data. The web server 214may be used in connection with displaying pages to a console in responseto a user selection, in response to a detected alert or alarm, obtainingsettings for different threshold and alarm levels such as may be used inconnection with notifications, and the like. The web server 214 may alsobe used in connection with communicating information to a device such asa pager in the event of a notification when a particular designatedthreshold for example of an alarm level has been reached.

RTAP 212 may provide for collection, management, visualization andintegration of a variety of different automated operations. Data may becollected and reported to the Watch server and stored in RTAP 212, theWatch server's database. As described elsewhere herein, RTAP 212 may beused in connection with performing security monitoring and providing forappropriate notification in accordance with different events that may bemonitored. RTAP may raise alerts, for example, in the event thatpredetermined threshold conditions or events occur in accordance withthe data store maintained by RTAP 212. One embodiment of RTAP isdescribed in following paragraphs in more detail.

RTAP 212 may be implemented using a commercially available real timecontrol system, such as Verano's RTAP product, or the FoxboroIntelligent Automation (IA) product or other SCADA or DCS system. Inother words, a conventional control system may be used not to control aphysical process, but to monitor the security activity of an industrialnetwork and connections. RTAP 212 may also be implemented usingcustomized software that uses a relational database. As will beappreciated by one of ordinary skill in the art, other embodiments mayuse other components.

In operation, each of the different agents may report data to RTAP 212through use of the receiver 210. RTAP 212 may then store the data,process the data, and perform event detection and notification inaccordance with predefined alarm levels and thresholds such as may beobtained from user selection or other defined levels. For example, asdescribed above, a user may make selections in accordance with variousalarm or alert levels using a browser with a GUI. These particularvalues specify a threshold that may be stored and used by RTAP 212. AsRTAP 212 receives data reported from the different agents, RTAP 212 mayprocess the data in accordance with the threshold(s) previouslyspecified. In the event that an alarm level has been exceeded orreached, the RTAP 212 may signal an alert or alarm, and provide for anotification message to be sent on one or more devices using the webserver 214 and/or notification server 206. It should be noted that thevarious designated location or device to which notification messages areto be sent may also be specified through the same GUI by which thethreshold levels are specified.

The threat thermostat controller 218 may be used in generating aresponse signal in accordance with one or more types of security threatinputs. In one embodiment, the threat thermostat controller 218 may useas inputs any one or more raw or derived parameters from the RTAP 212,other inputs that may be external to the Watch server, and the like. Inone embodiment, in accordance with these various input(s), the threatthermostat controller 218 selects one or more of the firewall settingsfrom 220 which controls access between the corporate network 12 and theindustrial network 14 as well as access to the industrial network 14from other possible connections.

In one embodiment, the threat thermostat controller 218 may use one ofthree different firewall settings from 220 in accordance with one ormore inputs. Each of the firewall settings included in 220 maycorrespond to one of three different threat levels. In the event that alow threat level is detected for example the firewall rule settingscorresponding to this condition may allow all traffic between thecorporate network 12 and the industrial network 14 as well as otherconnections into the industrial network 14 to occur. In the event that amedium threat level is determined, a second different set of firewallsettings may be selected from 220. These firewall settings may allow,for example, access to the industrial network 14 from one or moreparticular designated users or systems only within the corporate network12. If a high threat level is determined by the threat thermostatcontroller 218, all traffic between the corporate network 12 andindustrial network 14 may be denied as well as any other type ofconnection external into the industrial network 14. In effect, with ahigh threat level a determination, for example, an embodiment maycompletely isolate the industrial network 14 from any type of outsidecomputer connection.

Actions taken in response to a threat level indicator produced by thethreat thermostat controller 218 may include physically disconnectingthe industrial network 14 from all other external connections, forexample, in the event of a highest threat level. This may be performedby using a set of corresponding firewall rules disallowing suchconnections. Additionally, a physical response may be taken to ensureisolation of one or more critical networks such as, for example,disconnecting a switch or other network device from its power supply.This may be done in a manual or automated fashion such as using acontrol system to implement RTAP 212. Similarly, a mechanism may be usedto reconnect the critical network as appropriate.

In connection with low threat level determinations, the correspondingfirewall settings from 220 may allow data to be exchanged between theindustrial network and less trusted networks in predefined ways and alsoallow authorized users on less trusted networks to remotely log intocomputers on a critical network, such as the industrial network. Whenthe threat level as generated or determined by the threat thermostatcontroller 218 increases, the second set of firewall rule settings from220 may be used which provide for a more restrictive flow ofcommunication with a critical network such as the industrial network 14.For example, corporate may notify the industrial network that aparticular virus is circulating on the corporate network 12, that aHomeland Security alert status has increased, and the like. Using thesedifferent inputs, the second set of rules may be selected and allowcritical data only to be exchanged with less trusted networks and alsodisable remote log in capabilities. In the event that the highest orthird level of threat is determined by the threat thermostat controller218, what may be characterized as an air gap response may be triggeredleaving all less trusted networks physically disconnected until thethreat(s) have been addressed, such as, for example, by installing anyproper operating system and application patches.

In connection with the threat thermostat 218 in one embodiment, fivethreat levels may be utilized. Associated with each threat level may bea text file with a series of commands that define a particular firewallconfiguration including firewall rule sets, what network ports areenabled and disabled, address translation rules, and the like. All ofthis information may be included in each of the text files associatedwith each of the different threat levels.

One of the inputs to the threat thermostat controller 218 may include,for example, a security level as published by the Homeland Security, anassessment or threat level as produced by a corporate department, and/oranother source of a threat level that may be gathered from informationsuch as available on the Internet through a government agency or othertype of private organization and reported by the threat agent 200. Theseassessments may be weighted and combined by the threat thermostatcontroller 218 to automatically determine a threat level causing aparticular set of firewall settings to be utilized. A particularweighting factor may be associated with each of multiple inputs to 218making the determination of a specific indicator or threat level.

It should be noted that the particular firewall settings included ineach of the sets of 220 may include a particular set of firewall rules,address translations, addresses to and from which particularcommunications may or may not be allowed, intrusion detection andprevention signatures, antivirus signatures, and the like. Inputs to thethreat thermostat controller may also include, for example, one or moreraw metrics as provided from RTAP, and/or one or more derived parametersbased on data from RTAP and/or from other sources. It should be notedthat the threat thermostat controller may generate a signal causing datato be displayed on a monitor connected to the Watch server 50 such asthrough a console as well as to send one or more notification messagesto previously designated destinations. In one embodiment, the threatthermostat control level may be displayed on a GUI. In one embodiment,an alert may be generated when there is any type of a change in afirewall rule set or threat level either in an upward or a downwardthreat level direction.

An embodiment may provide for a manual setting of a threat thermostatlevel used in the selection of the firewall settings, and the like. Thismanual setting may be in addition to, or as an alternative to, automatedprocessing that may be performed by the threat thermostat controller 218in determining a threat level. Additionally, an embodiment may includeone or more user inputs in the automatic determination of a threat levelby the threat thermostat controller 218. It should be noted that in oneembodiment, once the threat level has risen out of the lowest level,only human intervention may lower the thermostat or threat level.

It should also be noted that although various levels of access withrespect to a critical network, such as the industrial network, have beensuggested in examples herein in connection with different threat levels,an embodiment may vary the particular access associated with each of thedifferent threat levels. Although three or five threat levels andassociated rule sets are described herein, an embodiment may include anynumber, more or less, of threat levels for use in accordance with aparticular application and embodiment.

Additionally, in connection with the data that has been gathered by RTAP212 such as raw data, alerts may be generated using one or more derivedor calculated values in accordance with the raw data gathered by theagents.

An embodiment may implement the database portion of RTAP 212 as anobject oriented database. RTAP 212 may include a calculation engine andan alarm engine in one embodiment. The calculation engine may be used toperform revised data calculations using a spreadsheet-like data flowprocess. The alarm engine may determine an alarm function or level usinga state table. Details of RTAP 212 are described elsewhere herein inmore detail.

It should be noted that any one or more hardware configurations may beused in connection with the components of FIGS. 3 and 4. The particularhardware configuration may vary with each embodiment. For example, itmay be preferred to have all the components of FIGS. 3 and 4 executingon a single computer system in a rack-mount arrangement to minimize theimpact on the physical layout of a plant or other location beingmonitored. There may be instances where physical location and layout ofa system being monitored require use of extra hardware in a particularconfiguration. For example, NIDS and ARPWatch may be monitoring theactivity of 3 different switches in an industrial network using thespanning ports of each switch. Each of the 3 switches may be located inphysical locations not in close proximity to one another or anothercomputer system hosting the components of the Watch server 50. Twoswitches may be located in different control rooms and one switch may belocated in a server room. One hardware configuration is to have thecomputer system upon which the Watch server components execute monitorthe one switch in the server room. Two additional processors may be usedin which each processor hosts agents monitoring execution of one of theremaining two switches. The two additional processors are each locatedin physical proximity near a switch being monitored in the controlrooms. The two additional processors are capable of supporting executionof the agents (such as the NIDS agent 204 and ARPWatch Agent 206) andany software (such as NIDS 204 a, ARPwatch 206 a) used by the agents.These processors are connected to, and communicate with, the computersystem upon which the Watch server components execute. As will beappreciated by those of ordinary skill in the art, the hardware and/orsoftware configurations used may vary in accordance with each embodimentand particular criteria thereof.

In one embodiment, it should be noted that the receiver 210 of the Watchserver 50 may track the last time a report was received by each agent(class 1 and class 2). In the event that the component 210 determinesthat an agent has not reported to the receiver 210 within somepredetermined time period, such as within 150% of its expected periodicreporting interval, an alert is raised by sending a notification to oneof the notification devices. Such an alert may indicate failure of anagent and/or machine and/or tampering with the watch system and/or withagents. Alerts may also be raised if agents report too frequently,indicating that someone may be trying to mask an attack or otherwiseinterfere with agent operation. Alerts may also be raised if agentreports are incorrectly authenticated, for example, if they areincorrectly encrypted, have an incorrect checksum, contain an incorrecttimestamp or sequence number, are from an incorrect IP address, are ofincorrect size, or are flagged as being destined for an IP address otherthan the address on which the receiver 210 is listening.

It should be noted that components 202, 203 and 209 may preferably sendencrypted communications where possible to other components besides thereceiver 210. Whether encryption is used may vary with the functionalityof the components communicating. An embodiment may use, for example,V3.0 or greater of the SNMP protocol with the components 202 and 203 inorder to obtain support for encryption. Component 209 may also useencryption when communicating with the firewall.

Referring now to FIG. 4A, shown is an example 400 of an embodiment of athreat thermostat controller 218 in more detail. In particular, theexample 400 illustrates in further detail the one or more inputs thatmay be used in connection with a threat thermostat controller 218 asdescribed previously in connection with the Watch server 50 of FIG. 4.An embodiment of the threat thermostat controller 218 may automaticallydetermine a firewall rule set and threat indicator 410 in accordancewith one or more inputs 402, 406 and/or 408 and 220. Inputs 402, 404,406 and 408 may be characterized as selection input which provides forselection of one of the firewall settings from 220. As an output, thethreat thermostat controller 218 may automatically send the selectedfirewall settings from 220 and a threat indicator level as a signal orsignals 410. Inputs 402 may come from external data sources with respectto the industrial network 14. The external data may include, forexample, an indicator from a corporate network, one or more inputs froman internet site such as in connection with a Homeland Security alert, athreat indicator generated by another commercial or private vendor, andthe like. This external data may come from network connections, or othertype of remote log in connections with respect to the industrial network14. Other types of input may include one or more RTAP inputs 404. TheRTAP inputs 404 may be raw data inputs as gathered by agents and storedwithin the RTAP 212 database, particular threshold levels, and the like.RTAP inputs 404 may also include a resultant value or indicator that isgenerated by processing performed by RTAP in accordance with one or moreof RTAP data values. An RTAP indicator included as an RTAP input 404 tothe threat thermostat controller 218 may be, for example, an indicatoras to whether a particular threshold level for one or more metrics isexceeded. The input to the threat thermostat controller 218 may alsoinclude one or more derived parameters 406. The derived parameters 406may be based on one or more raw data values as gathered by the agentsand stored in RTAP. These derived values may be stored within RTAP ordetermined by another source or module. Another input to threatthermostat controller 218 may be one or more manual inputs 408. Themanual input or inputs 408 may include, for example, one or more valuesthat have been selectively input by an operator such as through GUI orconfiguration file. These values may include a metric that may bemanually input rather than being received from an external source in anautomated fashion.

Although the various inputs described and shown in 400 have beenillustrated for use with a threat thermostat controller 218 in oneembodiment, it should be noted that any one or more of these as well asdifferent inputs may be used in connection with the threat thermostatcontroller to produce an output threat indicator. The outputs of thethreat thermostat controller 218 include a firewall rule set and threatindicator 410. The firewall rule set and other settings may becommunicated, for example, to a firewall as a new set of rules to beused for subsequent communications and controlling access to one or morecritical networks. In one embodiment, a new set of firewall rules may beremotely loaded from the Watch server location 220 to the firewall usingSSH (described elsewhere herein) and/or any of a variety of securecommunications mechanisms known to those of ordinary skill in the artsuch as, for example, IPSEC, HTTPS, SSL, and the like.

The threat indicator that may be produced by a threat thermostatcontroller 218 may also serve as an input to RTAP 212 and may be used,for example, in connection with generating one or more notificationsthrough use of the web server and/or notification server as describedelsewhere herein when a particular threat indicator level has increasedor decreased, a firewall rule setting selection has been modified andthe like. Additionally, data recording for the threat level, date, time,and the like may be recorded in RTAP 212. The threat thermostatcontroller 218 may also produce an output signal 411 used in connectionwith automatically controlling the operation of aconnecting/disconnecting the industrial network from the corporatenetwork in accordance with the threat indicator. For example, the signal411 may be input to RTAP, a control system, switch or other hardwareand/or software used to control the power supply enabling connectionbetween the industrial network and corporate network as describedelsewhere herein.

It should be noted that in one embodiment, only manual inputs may beused. A single manual input may be used in one embodiment, for example,in selection of a threat indicator causing the threat thermostatcontroller 218 to make a selection of a particular firewall setting.Another embodiment may provide for use of a combination of automatedand/or manual techniques where the automated technique may be used toproduce a threat indicator unless a manual input is specified. In otherwords, rather than weight one or more manual inputs in connection withone or more other inputs in an automated fashion, the manual input orinputs may serve as an override of all of the other inputs in connectionwith selecting a particular firewall rule set from 220 and generating athreat indicator. Such a manual override may be provided as an option inconnection with a mode setting of a threat thermostat controller 218. Ifthe override setting which may be a boolean value is set to on or true,the manual input will act as an override for all other inputs and anautomated technique for producing a threat indicator. In the event thatoverride is set to off, the manual input may not be considered at all,or may also be considered along with other inputs in connection with anautomated technique used by the threat thermostat controller.

Referring now to FIG. 5, shown is an example 300 of the different typesof agents of the first class of agent that may be utilized in anembodiment of the industrial network 14. It should be noted that theagents 300 may be included and executed on each of the computer systemsin the industrial network 14 as indicated by the agents 132 a-132 d. Inother words, the different agent types included in 300 are those typesof agents that may execute on a system and report information about thatsystem to the Watch server 50. It should be noted that although anembodiment may include the particular agent types of 300, an embodimentmay include different types of agents and a different number of agentsthan as described herein in accordance with the particular applicationand embodiment and may vary for each computer system included in theindustrial network 14.

Included in 300 is a master agent 302, a critical file monitoring agent304, a log agent 306, a hardware and operating system agent 308, apassword age agent 310, and an application specific agent 312. In oneembodiment, the master agent 302 is responsible for control of the otheragents included in the computer system. For example, the master agent302 is responsible for starting and monitoring each of the other agentsand to ensure that the other agents are executing. In the event that themaster agent 302 detects that one of the other agents is not executing,the master agent 302 is responsible for restarting that particularagent. The master agent 302 may also perform other tasks, such as, forexample scheduling different agents to run at different periods of time,and the like.

The critical file monitoring agent 304 may be used in connection withmonitoring specified data files. Such data files that may be monitoredby agent 304 may include, for example, operating system files,executable files, database files, or other particular data file that maybe of importance in connection with a particular application beingperformed within the industrial network 14. For example, the agent 304may monitor one or more specified data and/or executable files. Theagent 304 may detect particular file operations such as file deletion,creation, modification, and changes to permission, check sum errors, andthe like. Agent 304, and others, gather information and may report thisinformation at various time intervals or in accordance with particularevents to the Watch server 50.

The log agent 306 may be used in monitoring a system log file for aparticular computer system. The log monitoring agent 306 may look forparticular strings in connection with system activity such as, forexample, “BOOT”, or other strings in connection with events that mightoccur within the computer system. The log agent 306 searches the logfile for predetermined strings of interest, and may store in memory thestring found as well as one or more corresponding metrics such as, forexample, the number of occurrences of a string. For example, the logagent 306 may count occurrences of a BOOT string and report the count ina single message which may be sent to the Watch server or appliance. Thesending of a single communication to the Watch server may be performedas an alternative, for example, to sending a message reporting theoccurrence of each string or event. Techniques such as these provide forefficient and bounded use of resources within the industrial network 14resulting in reduced bandwidth and CPU and memory usage consumed by theagents.

In one embodiment, the agent 306 may report the following metrics atperiodic intervals:

Login failures—Report the number of “failed login” messages in thesystem log in the reporting interval. The format of these messages mayvary in accordance with software platform, such as operating system andversion and login server, such as for example, ssh, telnet, rlogin, andthe like. Reported with this metric may be the names of the top threeaccounts reporting login failures in a reporting interval, and whatpercentage of the total number of failure reports is associated witheach of these three accounts.

Password change failures—Report the number of “failed password changeattempt” messages in the system log in the reporting interval. Some ofthese failures may be the result of an authorized user trying to changehis/her own password. This metric may indicate false positives such asthese in addition to indicating a brute force password attack by anunauthorized user. Reported with this metric may be the top threeaccounts reporting failed password attempts in a reporting interval anda corresponding percentage of failed attempts associated with eachaccount.

Network ARPwatch—Using the modified version of ARPwatch describedelsewhere herein, this metric reports the number of unapproved IP/deviceaddress bindings currently on the network. The ARPwatch metric alsoreports the first three ARPwatch log messages detected in each reportinginterval, and if the metric is non-zero in an interval, reports the topthree IP addresses and device addresses responsible for those messages.

Host IDS audit violations—Report the total number of IDS and failureaudit messages detected in the reporting interval. When the IDSclassifies the messages, report a count for each classification—eg:critical, warning. When multiple audit systems are running on a machine,report each system's output independently. For example, on SELinuxsystems, the SELinux system reports authorization failures for allfailed accesses to protected resources. Such authorization failures arereported as a separate SELinux authorization failure metric.Additionally, report the first three log messages detected in eachclassification in each reporting interval and a count of the messagesnot reported. This metric may be extended to report a summary of all themessages detected for each kind of message in the reporting interval aswell, including, where process information is available, the top threeprocesses responsible for the messages and the percentage of totalmessages associated with each process and/or, where file information isavailable, the top three files that are reported as the targets ofaudited file manipulations, and what percentage of all the IDS messageseach file was mentioned in.

Host IDS antivirus alerts—for host IDS systems that monitor and reporton viruses detected on control system hardware. Note that while somecomputers in the industrial network may not execute antivirus softwarefor performance, compatibility, or other reasons, other computers withinthe industrial network may utilize antivirus software. An embodiment ofthis agent may also report the first three such messages detected in areporting interval.

The agent 306 may also include metrics related to the following:

Web page authentication failures, web page permission violations, totalweb page failures, firewall violations (described herein in moredetail), and tape backup failures. This last metric may be useful inconnection with notifying administrators, for example, in the event thatsecurity history or other information is no longer being backed up on atape or other backup device.

Windows event logs—Report the number of Windows “Error,” “Warning,”“Informational,” “Success Audit,” and “Failure Audit,” log entriesdetected in the reporting interval. The agent also reports the first 256characters of text for the first three such log entries discovered forevery type of log in every reporting interval.

The hardware and operating system agent 308 may be used in connectionwith gathering and reporting information in connection with theoperating system and hardware. For example, through execution of astatus commands or others that may be available in an embodiment,information may be produced using one or more operating system utilitiesor calls. As an output of the command, data may be produced which isparsed by the hardware operating system agent 308 for the particularstatistics or metrics of interest. In one embodiment the hardwareoperating system agent 308 may use one or more status commands, forexample, to obtain information about CPU load, disk space, memory usage,uptime, when the last reboot has occurred, hardware and softwareinformation such as related to version numbers, and the like. Similar tothe behavior of other agents, the hardware operating system agent 308may parse the output of different status commands and send a singlereport to the Watch server at different points in time rather thanreport multiple messages to the Watch server. For example, the agent 308may combine information from multiple status commands and send a singlecommunication to the Watch server or appliance 50 at particular timeperiods or in accordance with particular events.

In one embodiment, the following metrics may be reported by agent 308 atperiodic intervals:

Counts of interprocess communications resources used—System V messagecounts and segment counts for any message queues used by the controlsystem. If the system is becoming congested such as, for example,messages requests are backing up because the system does not have theresources to process them fast enough, or because of a failure of somecomponent of the system that is supposed to be processing them, an alarmmay be raised when any message queue's message or segment count exceedsa preset limit.

Operating System type—This is a constant value reported to assist inauto-configuration of the Watch server. For example, when anadministrator is searching for a particular machine when using the GUIon the Watch server, that person may become confused by the manymachines whose information is available via the GUI. Operating systemtype, version number, and the like, may be used in identifying aparticular machine through the GUI.

“Uptime”—How long it has been since the machine running the agent 308has been rebooted. This is sometimes useful in post-mortem analysis. Forexample, a number of anomalous alerts are generated by the Watch serverin connection with resource and system performance metrics. It may alsobe observed that a particular computer in the industrial network wasalso being rebooted during this time period. It may be determined thatthe abnormal resource usage, for example, was due to the machine rebootor restart activity which may differ from the usage when the computer isin steady state. Thus, it may be useful to determine why the machinerebooted and restarted rather than investigating the individually raisedresource-generated alerts.

User count and root user count—How many login sessions are active on themachine, and how many of those belong to “root” or other account with anelevated level of permissions. Each metric reports not just the count oflogins, but for the first N logins, such as 20 for example, report wherethe user is logged in from, such as the machine's console, or the IPaddress, or host name of some other machine the user is logged in from.In one embodiment, the foregoing metrics may be determined on aUnix-based system. Other systems may provide similar ways to obtain thesame or equivalent information. Note that below as known to those ofordinary skill in the art, “tty” is a UNIX-specific reference to a UNIXdevice that manages RS232 terminal connections:

1) Determine who is logged into the system. In one embodiment, this maybe determined by examining the output of the “who” command. (i.e., OnHP-UX, “who-uR”. On Tru64 use “who-M”).

2) From the output of 1), extract the user name, the source machine ifany, and the “tty” or other login user identifier.

3) Determine which user(s) are currently using system resources. Thismay be determined, for example, using the Unix-based “ps” command.

4) Since “Who” command output may be sometimes stale, the number ofcurrent users may be determined in a Unix-based system by remove fromthe “who” list (from 1) any user whose identifier is not associated withany active process identified in 3).

5) When reporting root user counts, an embodiment may also search the/etc/passwd file for each user “who” reports. Any user with a numericuser ID of 0 is in fact running as root and is reported as such. Since asingle user logged in on the console may have many terminal windows openand “who” reports each as a separate login, it may be desirable toreport the foregoing as a single user.

6) All of the login sessions whose source “who” reported as “:0” asassociated with the console display device, may be combined into asingle entry.

7) Count and report all root users whose “source” is reported by the“who” command as a machine other than the local machine.

8) Examine the “tty” of all potential root users. Increment the rootcount for, and record the source for, every unique standard inputdevice:

-   -   “ptyN”—a pseudo tty used by X11 windows    -   “ttyN”—a Linux pseudo console created using the keystroke        combination: <Ctrl><Alt><Fn>    -   serial port logins—“ttySN” on Linux or “ttyN” on HP-UX, Solaris        and Tru64,    -   “console” for console device logins and “:0” as well for console        logins on Tru64.

9) Determine the owner of any of the standard window manager processesthat might be running on the machine. If a root console login has notalready been identified, and any of these processes is running as root,that counts as a root console login.

(10) Determine and count the number of rsh (remote shell) users. Use theUnix “ps” command to identify remote shell daemon processes (eg: remshd)and their child processes. Count and report as users the useridentifiers associated with all such child processes.

CPU Load—On a Windows-based system, report the % CPU used in thereporting interval. On UNIX systems, report the CPU load average for theinterval. The load average is the average number of processes competingfor the CPU. Note that on a Windows system, if some process is spinningin a tight loop, the machine reports 100% CPU usage. This metric may becharacterized as a rather blunt reporting instrument since it gives noidea of how much “other” work the machine is accomplishing over andabove the process that is currently looping. On a UNIX system in asimilar state, a report may indicate, for example, a load average of1.5. The process in a loop accounts for 1 load unit (it was alwayswanting the CPU). The additional 0.5 indicates that over and above theexecuting process, ½ of the time some other process wanted the CPU forexecution purposes. Additionally, the top three process names consumingCPU in a reporting interval may be reported along with what portion(such as a fraction or percentage) of the CPU used in the interval wasused by each of these top three processes.

% Disk space—for every disk partition or volume, report the % used foreach. It should be noted that RTAP may be configured to alert when thepercentage of this metric is not in accordance with an absolutethreshold.

% Swap space—Report the % used for swap space. An alert may be generatedwhen that metric increases to an absolute threshold. Additionally, ineach reporting interval, the top three processes using memory and/orswap space may be reported and what % of the available resource each ofthese processes consumes.

% Memory Used—Report the fraction of physical memory used. It should benoted that some operating systems report swap and memory as one metricand so the % swap space metric may be the metric reported. In that case% swap space combines memory used and swap used into one total andreports the fraction of that total in use.

Hardware (such as LM) Sensors and Disk Status (such as SMART)—Reportmetrics from sensors on different computer components, such as the CPU.The values reported on different platforms may differ in accordance withthe different hardware and/or software and/or monitoring circuits.Examples of the metrics that may be reported by one or more sensors mayinclude CPU temperature, case temperature, fan speed for any of severalfans, power supply working/failed status in machines with multiple powersupplies, soft hard disk failures, etc. These sensors can provide earlywarning of impending hardware failure of critical industrial controlcomputer components. Note that “SMART” refers to the Self MonitoringAnalysis and Reporting Technology as described, for example, athttp://smartmontools.sourceforge.net. “LM” refers to, for example,“LM-78” and “LM-75” hardware monitoring functionality that is standardin some vendor chipsets and is described, for example, athttp://secure.netroedge.com/˜1m78.

Network Traffic—Report total incoming and total outgoing traffic forevery network interface on the computer. An abnormal increase in suchtraffic can mean either that the machine is under attack, that themachine has been compromised and is being used to attack anothermachine, that the machine has been compromised and is being used forsome purpose other than it was intended for, or that there has been somesort of malfunction of the control system on the machine.

Open listen sockets —Reports a count of open listen sockets on thecomputer. Listen sockets are almost always associated with long-runningserver processes, and the count of such processes and sockets almostalways changes very predictably on control system computers. Forexample, the count of such sockets may fall within a very small rangehaving little variance. When the count moves out of this range, an alertmay be generated. When the listen socket count falls out of the bottomof the range, it may be an indication that some server component of theoperating system or of the control system itself has failed, causing theassociated listen socket to close. When the listen socket count risesout of the top of the normal range, it may indicate that some servercomponent has been added to the system. This may be, for example, a newcontrol system component, a debugging or other tool that it may or maynot be wise to deploy on a production control system, or a componentinstalled by an intruder or authorized user seeking to gain unauthorizedaccess to the system in the future.

The password age agent 310 may be used in monitoring the status ofdifferent passwords and accounts. Such activity may include passwordaging. The particular metrics that may be gathered by the agent 310 mayrelate to the security of the computer system being monitored as asecurity measure to detect hackers trying to log in to the system.

In connection with one embodiment, the agent 310 may report thefollowing at periodic intervals:

Max password age—Once an hour, a maximum password age metric may bereported that measures the age of every password on every account on thesystem. Included in this reported metric may be the age of the oldestpassword on the system, and, for each of the first 100 users on thesystem, the user name and age of that user's password. This metric maybe useful when raising alerts when passwords become too old. Some siteshave a policy of changing passwords regularly, eg: at minimum every 180days. Because user names and password ages can serve to help identifyvulnerable accounts on a computer system, these information may beseparately encrypted when the primary communication channel for theagent report is not already encrypted.

The application specific agent 312 may be customized, for example, tomonitor specific application parameters that may vary with a particularembodiment and/or application executing in the industrial network 14.Generally the application specific agent 312 is an agent that may bebuilt and specified by a particular operator of the industrial network.

In one embodiment, the agent 312 may report information about theparticular application at periodic intervals including any of thefollowing metrics:

Abnormal process terminations—A count of control system processes thathave terminated unexpectedly or with an improper exit/terminationstatus. The names of a number of such processes which failed in thereporting period may also be reported. These names occupy a fixedmaximum size/communications budget, and mean that the next level ofdetail is available in the Web GUI. It should be noted that this metricmay include reporting information regarding the first processes in timeto terminate unexpectedly rather than the last processes in time toterminate unexpectedly since the last processes may have terminated as aresult of the earliest failed processes. For example, a later terminatedprocess (unexpectedly terminated) may have terminated execution as aresult of being unable to communicate with an earlier terminated process(unexpectedly terminated). Also included with this metric when non-zeroare the names or other identifiers of the processes that failed mostfrequently, and what percentage of the total number of failures isassociated with each.

Installed software—Reports a count of software packages installed,uninstalled and/or updated in the last reporting period. Thisinformation comes from whatever sources are available on the computersuch as, for example, one or more log files that are appended to whensoftware is installed, uninstalled or updated, and one or more systemdatabases of installed software that are updated when software isinstalled, uninstalled or updated.

Another type of application specific agent 312 may be a control systemsoftware agent with knowledge of the expected behavior of a specificcontrol system such as, for example, a Foxboro IA system, a WonderwareInSQL server, or a Verano RTAP server (such as the RTAP component 312),and the like. Such agents may report some metrics already describedherein such as:

Installed software—When the control system software itself has newcomponents installed, or installed components updated or removed.

Process terminations—either terminations reported as abnormal by thecontrol system software application itself, or processes no longeractive that the agent “knows” should be active because a correctlyfunctioning Foxboro or RTAP or other system should have such processesactive to run correctly.

Open listen sockets—The number of open listen sockets. An embodiment mayreport and monitor the number of open listen sockets that are managed bythe control system and are expected to be open for the correct operationof the control system. Note that the number of open listen socketsrefers to an embodiment that may use, for example, UDP or TCP. Thismetric may be more generally characterized as the number ofcommunication endpoints or communication channels open on a servermachine upon which the server is listening for client requests.

Control system shutdown—Reports all controlled and uncontrolledshutdowns of the control systems application. In the case of anunexpected shutdown of the entire computer running the control system,where there may not have been an opportunity to report the shutdownbefore the computer itself shuts down, the shutdown may be reported whenthe computer and the agent restart.

An embodiment may also include other types of agents not illustrated in300. Some of these may be optionally included in accordance with theamount of resources available as well as the particular task beingperformed by a system being monitored. For example, an embodiment mayalso include a type of agent of the first class reporting on file systemintegrity characteristics, such as changes in file checksum values,permissions, types, and the like. Execution of such an agent may be tooCPU and/or disk I/O intensive to scan entire filesystems, or todetermine checksums for large number of files in a system, so this agentmay be selectively included in particular embodiments. The file systemintegrity agent may report the following metric at periodic intervals:

Integrity failures—For IDS that monitor and report on file integrity,report the total number of integrity failure messages detected in thereporting interval. For systems that report different kinds of, orpriorities of, integrity failures, an embodiment may report the totalintegrity failures in each classification. For each classification offailure, also report the first three integrity failure log messages orevents detected in the reporting interval and a count of the remainingmessages not reported. Integrity failures may be discovered by the agentitself, or the agent may monitor the results of conventional integritychecking tools such as Tripwire or may invoke installation integritychecking tools such as fverify. For more information on Tripwire, seehttp://www.tripwire.org. For more information on fverify, see:http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51_HTML/MAN/INDEXES/INDEX_F.HTM

In addition, to provide a degree of filesystem integrity checking on aresource-constrained computer system, an embodiment may pace filesystemintegrity checking such that only a small number of files are checked ina given reporting interval to reduce and/or limit the CPU and disk I/Oimpact of such checking. Such an embodiment may, for example, alsoclassify files into two or more levels of importance, and may scan somenumber of files from each level of importance in each reportinginterval. This way, when the lowest level of importance has many morefiles in it than the highest level of importance, more important fileswill tend to be checked more frequently than less important files, whilestill controlling the resource impact of the scanning.

It should be noted that the particular metrics reported by each agent aswell as other particular agent characteristics, such as periodicreporting intervals, may vary with each embodiment. Although thedescription of the various metrics herein may be made with reference toelements particular to this embodiment, such as a the UNIX-basedoperating systems, it will be appreciated by one of ordinary skill inthe art that equivalents may be used in connection with other elements,such as other operating systems that may be used in an embodiment.

It should be noted that an embodiment may issue an alert using RTAP whenany one of more of the metrics, reported as a count or value rather thana boolean, described herein is not in accordance with an absolutethreshold.

Data flows from each of the different types of agents of the first class132 a-132 d on computer systems in the industrial network to the Watchserver 50. In this embodiment, the Watch server may be characterized asan appliance that is a passive listening device where data flow is intothe appliance. Within the Watch server, a process may be executed whichexpects a periodic report from the agents 132 a-132 d as well as areport from the agents 132 a-132 d when a particular event occurs. Ifthere is no report from a particular agent 132 a-132 d within apredefined time period, the Watch appliance may detect this and considerthe agent on a particular system as down or unavailable. When aparticular system, or the particular agent(s) thereon, in the industrialnetwork 14 are detected or determined to be unavailable or offline, asmay be determined by the RTAP 212 of FIG. 4, an alarm or alert may beraised. Raising an alarm or alert may cause output to be displayed, forexample, on a console of a notification device.

Collectively the different types of agents provide for gathering datathat relates to the health, performance, and security of an industrialnetwork. This information is reported to the Watch appliance or server50 that uses the health, performance and security data in connectionwith security threat monitoring, detection, and determination.

Each of the agents may open up its own communication connection, such asa socket, to send data to the Watch server. An embodiment mayalternatively use a different design and interaction of the differenttypes of agents than as illustrated in 300. In one embodiment using theexample 300, each agent may be implemented as a separate process. In analternative embodiment, a single process may be used performing theprocessing of all the different types of agents illustrated in FIG. 5and all data may be communicated to the Watch server over a singlecommunication connection maintained by this single process. Anembodiment may use another configuration to perform the necessary tasksfor data gathering described herein.

It should be noted that an embodiment may include the master agent withany one or more of the different types of agents for use with a systembeing monitored. Using the model of FIG. 5, the master agent isnecessary to control the operation of one or more of the other types ofthe first class.

Referring now to FIG. 6, shown is an example 350 of the architecture ofeach of the agents of the first and second classes described herein. Itshould be noted that the architecture 350 may vary in a particularembodiment or with a particular class of agent. The particularillustration of FIG. 6 is only an example and should not be construed asa limitation of the techniques described herein.

An agent data source 352 is input to an input data parser 354. Theparticular data source 352 may vary in accordance with the particulartype of agent. For example, in the event that the agent is a log fileagent, the agent data source may be a system log file. In the event thatthe agent is the hardware operating system agent, the agent data sourcemay be the output of one or more status commands. The one or more datasources 352 are input to the data parser 354 for parsing. The particulartokens which are parsed by 354 may be passed to the pattern matchingmodule 356 or the metric aggregator and analyzer 358. It should be notedthat there are times when the parsed data may be included in a messageand does not require use of pattern matching. The pattern matchingmodule 356 searches the data stream produced by 354 for those one ormore strings or items of interest. The pattern matching module 356 mayreport any matches to the metric aggregator and analyzer 358. Thecomponent 358 keeps track of summary of the different strings as well ascounts of each particular string that have occurred over a time periodas well as performs processing in connection with immediatenotification. As described elsewhere herein, an agent may report data tothe Watch server 50 at periodic reporting intervals. Additionally, theagent may also report certain events upon immediate detection by theagent. This is described elsewhere herein in more detail. The metricaggregator and analyzer 358 also controls the flow of data between thedifferent components and is also responsible for compressing themessages to minimize the bandwidth function.

Once the metric aggregator and analyzer 358 has determined that amessage is to be reported to the Watch server 50, such as for immediatereporting or periodic reporting of aggregated data over a predeterminedtime period, the metric aggregator and analyzer 358 may send data to theXML data rendering module 362 to form the message. The XML datarendering module 362 puts the information to be sent to the Watch server50 in the form of an XML message in this particular embodiment.Subsequently, component 362 communicates this XML message to the messageauthentication and encryption module 360 for encryption prior to sendingthe XML message to the Watch server or appliance.

In connection with the message authentication and encryption module 360of FIG. 6, it should be noted that any one of a variety of differenttypes of encryption techniques may be used. In one embodiment, atimestamp and agent host name or identifier may be included in a messagebody or text. The authentication processing on the Watch server 50, suchas may be performed by the receiver 210, may require that the timestampvalues always increase and otherwise reject duplicate or out of datemessages. Additionally, an encryption technique may be used whichutilizes a key, such as a shared secret key, and the entire message maybe encrypted with this key. The shared secret key provides the messageauthentication information. An embodiment may also use other well-knowntechniques such as, for example, the MD5 cryptographic checksum andencrypt the checksum of the entire message. The authenticationprocessing performed within the Watch server 50 may vary in accordancewith the techniques used by the agents. In one embodiment, an agent mayencrypt the checksum of the message and not the message itself.Alternatively, in an embodiment in which a checksum determination of amessage is not available, the agent may encrypt the message.

The different types of data reported by the types of first class ofagents illustrated in FIG. 5 relate to the health, performance, andsecurity of a critical network, such as the industrial network 14. Thisdata as reported to the Watch server 50 enables the Watch server 50 togenerate signals or alerts in accordance with the health, performance,and security of the critical network. In particular, the RTAP 212 of theWatch server may be characterized as a global aggregator and monitor ofthe different types of data reported to a central point, the Watchserver 50.

The agents 132 a-132 d (of the first class described elsewhere herein)as well as the second class of agents that communicate data to the Watchserver 50 may be characterized as distributed monitoring agents. In oneembodiment, these agents may raise alerts or send reports to the Watchserver in summary format in accordance with predefined time periods, orin accordance with the detection of certain events in order to conservethe bandwidth within the industrial network 14. In existing systems,agents may report every occurrence of a particular event, such as asuspicious activity, and may result in the consumption of excessivebandwidth when a system is under attack. An agent, such as one of thefirst class executing in the industrial network 14, may report attacksummaries at fixed intervals to conserve network resources. For example,an agent 132 a-132 d may report the occurrence of a first suspiciousevent and then report a summary at the end of a reporting period. Inother words, reports may be sent from an agent at predetermined timeintervals. Additionally, the agents may send messages upon the detectionor occurrence of certain conditions or events.

The agents (first class and second class when communicating with thereceiver 210) included in an embodiment may be designed in accordancewith particular criteria. As described in connection with the aboveembodiment, the agents are “one-way” communication agents at theapplication level for increased security so that operation of an agent,such as on a component in the industrial network 14, minimizes addedvulnerability to a network attack. The agents communicate with the Watchserver by opening a TCP connection, sending an XML document over theconnection, and closing the connection after the XML communication issent. The agents do not read commands or requests for information fromthis connection from the Watch server.

It should be noted that a computer hosting an agent does receive andprocess messages from the Watch server. However, the processingperformed by such a host to an agent are limited to processing steps atlower network levels. For example, in an embodiment using the XMLmessages described herein, this processing may include the TCP-levelconnection setup, teardown and data acknowledgement messages performedat levels lower than the application level. Any vulnerabilities existingat these lower levels exist independent of whether the agents describedherein are utilized: In other words, use of the agents described hereindoes not introduce any additional vulnerabilities into monitored andnetworked control system equipment.

The agents, in particular the first class of agents described herein,may be characterized as bandwidth limited agents designed to consume afixed portion of available network resources. Conventional securityagents tend to report every anomalous event upon the occurrence of theevent consuming potentially unbounded communication resources underdenial-of-service attacks. Conventional agents may regard every securityevent as important and make a best-effort attempt to communicate everysuch event to their management console. Agents that consume an excessiveamount of a limited network communications resource risk causing theentire system to malfunction, triggering safety relays and othermechanisms to initiate an emergency shutdown of the industrial process.

In contrast, the agents described herein are designed to transmit smallfixed-size messages at fixed intervals, thus consuming a bounded portionof available communications resources, even under denial-of-serviceattack conditions. The first class of agents herein gather information,produce condition reports and event summaries, and report thoseconditions and summaries at fixed intervals. The reports may include:statistics in accordance with the particular first class agent type,such as, for example, resource usage statistics like % CPU used, CPUload factors, % memory used, % file system used, I/O bus utilization,network bandwidth utilization, number of logged in users, and the like.The report may also identify the top N, such as, for example, two orthree, consumers of one or more resources. The consumers may beidentified by, for example, process names, directories, source IPaddresses, and the like, and may identify, when appropriate, whatportion of a resource each is consuming. The report may also includeother information that may vary with agent type and class such as, forexample, counts of log messages and other events, like login failures,network intrusion attempts, firewall violations, and the like detectedin the reporting interval that match some criterion or searchexpression; representative samples of the complete event description orlog message for the events counted in the reporting interval, and ashort statistical summary of the events, such as what host or IP addresshosted the most attacks and what percentage of attacks overall werehosted by a particular computer, which host was most attacked and whatpercentage of attacks were targeted at a particular host, what useraccount was most used in connection with launching an attack and whatportion of attacks are targeted at a particular user account. In oneembodiment, a reporting threshold for an agent may be specifiedindicating a maximum amount of data the agent is allowed to transmitduring one of the reporting intervals. The reporting threshold mayspecify, for example, a number of bytes that is equal to or greater thana size of a summary report sent at the reporting interval. For a givenreporting interval or period, an agent's reporting budget may be thereporting threshold. The agent may also report one or more othermessages as needed besides the summary in accordance with the specifiedreporting threshold. Prior to sending a report, the agent makes adetermination as to whether it is allowed to send a next report bydetermining if the total amount of data reported by an agent wouldexceed the reporting threshold by sending the next report. If thethreshold is exceeded, the agent does not send the report.

The agents described herein, in particular the first class of agents,are also designed to limit the amount of processing time and storage(disk and memory) consumed. Conventional intrusion detection andperformance monitoring agents are known for the negative performance andstorage impact on the system being monitored. SNMP components, forexample, have been known to consume all of the memory on a host of theSNMP component. AntiVirus scanners may impair the performance of themachines they are monitoring by up to 30-40% depending on the particularprocessor. The foregoing may not be acceptable in connection with legacysystems, such as may be encountered in industrial networks. Industrialcontrol applications respond to inputs from the industrial processwithin very short time windows due their real-time processing nature.Furthermore, such systems render information about the process tooperators in a timely manner. Anti-virus solutions, for example, may notgenerally be deployed on control system hardware, such as in theindustrial network 14 described herein, because the anti-virusprocessing may impair the operation of a system sometimes causing systemfailure. The agents described herein are designed to minimize theresource impact on the system being monitored. Expensive metrics, likefilesystem checksums, are gathered over a very long period of time, orfor only the most security-critical components so that the impact of thedata gathering on the system being monitored is within a small fixedbudget. For example, in one embodiment, 1-3% of all of a machine'sresources can be allotted to the monitoring agents executing thereon.

An embodiment of RTAP 212 may use an event reporting technique referredto as the exponentially decreasing attack reporting. In someembodiments, when a metric goes into an alert state and a user hasrequested notification of the alert, an e-mail or other notificationmessage is sent indicating that a particular metric has gone into alertstate. If the “current value” of the metric, for example, returns to the“normal” range, a notification message may also be sent regarding thistransition. The foregoing may cause a large burst of notificationmessages to be sent to an administrator and important information may beoverlooked due to the volume of notification messages received in ashort time interval. For example, in the event that the alert or alarmcondition exists for some time period, an initial set of notificationmessages may be sent when an attacker succeeds in compromising onemachine. Reported by agents on that machine in the industrial networkmay be high memory and network usage as a consequence of beingcompromised and an illicit web server started. When usage levels returnto normal, another set of notification messages may be sent. However,suppose that the memory and network alert conditions do not return tonormal. The foregoing conditions may be overlooked in the burst ofnotification messages. An administrator with access to a web browsercould log into the Watch web user interface and see that metrics on aparticular host were still in an alert state, but email notification mayalso be used by administrators who do not have such access. Accordingly,an embodiment may use the “exponentially decreasing notifications”technique which reports the initial alert. Instead of staying silentuntil the next alert state change, additional alert notices are sentwhile the metric stays in an alert state. The frequency with which theseadditional alert notices are sent may vary in accordance with the lengthof time an alarm condition or state persists. In an embodiment, thisfrequency may decrease exponentially, or approximately exponentially. Inone embodiment, the following alert or alarm notification messages maybe sent upon the first detection of an alarm or alert condition, and atthe end of a first defined reporting interval. At this point, there maybe additional summary information that may be optionally sent to theuser with the notification message. This is described in more detailherein using the enhanced email notification described elsewhere herein.Subsequently, a notification message is sent at increasing intervalswhile the condition persists. These time intervals may be user specifiedas well as defined using one or more default values that may vary withan embodiment. For example, in one embodiment, an initial reportinginterval of an alarm condition may be every minute. After the end of thefirst reporting interval or minute, notification messages may sent attime intervals so that a current reporting interval is approximately 10times longer than the previous reporting time interval. In this example,if the first notification message is sent after 1 minute, the secondnotification message may be sent after 10 minutes and include anyadditional information such as may be available using the enhancede-mail reporting described elsewhere herein. The third notificationmessage may be sent at about 1½ hours later, and so on. The reportinginterval may reach a maximum of, for example, 12 hours so that if analarm or alert state persists, notification messages with enhancedreporting (such as enhanced e-mail) may be sent every 12 hours until thealert condition clears, or the user otherwise removes themselves fromthe notification list.

Using the foregoing notification reporting technique, persistent alertconditions that may otherwise be lost in a burst of notificationmessages may remind the administrator that there is a persistent problemcondition, and provide the administrator with current summaryinformation so that the administrator can see if the nature of theattack or compromise is changing over time. Using this type ofexponentially decreasing attack reporting techniques, the bandwidth ofthe network may be more wisely utilized for the duration of the attackas well. The foregoing exponentially decreasing notification reportingmay be performed by the notification server 216 of FIG. 4. The alarm oralert conditions may be produced using the calculation as describedelsewhere herein causing the notification server to be notified.However, the foregoing may be performed by the notification server toreduce the number of times that a notification message is sent.

Additionally, an embodiment may use different techniques in connectionwith when agents report to the Watch server 50. One design concern, asdescribed elsewhere herein, is minimizing the amount of networkbandwidth used for reporting due to the possible bandwidth limitation ofthe industrial network. In one embodiment, for one or more designatedmetrics in a defined reporting interval, the log agent 306 may reportthe first detection of a log message that causes the metric to incrementas soon as the log message is detected. Subsequently, the agent does notreport any additional information to the Watch server about the metricuntil the end of the reporting interval, when the agent 306 then reportsthe total count for the metric in the reporting interval. Using theforegoing, immediate notification may be achieved upon the occurrence ofthe metric increase and then an update received at the next reportinginterval. The foregoing immediate notification may be used with metricsdetermined using log files. An embodiment may also use other agent typesto report other metrics that may be readily determined on an event basissuch as, for example, a line being added to a log file, a filedescriptor becoming readable, or a command executing.

An embodiment may use a combination of the foregoing immediatenotification and periodic interval reporting. In an embodiment usingjust the periodic interval reporting by the agents to the Watch server50, there may be an unacceptable delay in reporting alarm conditionsindicating an attack or attempted attack. For example, consider anagent's reporting interval of 60 seconds. One second into that interval,the agent viewed a failed login attempt indicated by a metric and thenanother 10,000 such attempts in the minute. A single report is sent atthe end of the minute reporting interval to the Watch server with areport metric indicating the 10,001 attempts. However, there is a delayand an administrator may expect to receive a notification of the eventprior to 10,000 being detected. Using the immediate notification, theagent also reports the first occurrence of the failed login attempt whenit is detected. Accordingly, the Watch server 50 may respond with animmediate notification message with the first occurrence or detection ofthe metric increase.

It should be noted that the foregoing immediate notification may beperformed in accordance with user selected and/or default specifiedconditions. This may vary with each embodiment. In one embodiment, themetric aggregator and analyzer 358 may perform processing steps inconnection with the immediate reporting and also periodic intervalreporting to the Watch server 50.

Referring now to FIG. 7, shown is a flowchart 450 of processing stepsdescribing the control flow previously described in connection with 350of FIG. 6. Generally, the processing steps of 450 may be performed in anembodiment by each of the agents of the first and second classes whenprocessing the one or more input data sources. At step 452, adetermination is made as to whether input data has been received by theagent. If not, control continues to wait at step 452 until input datahas been received. It should be noted that the input data may bereceived in accordance with the agent performing a particular task suchas executing a command producing input, waiting for input on acommunications channel, reading a data file, and the like, in accordancewith one or more predefined criteria. The one or more predefinedcriteria may include performing a particular task at predefinedintervals, when a particular data file reaches a certain level ofcapacity in accordance with a number of operations, and the like. Theparticular criteria which causes the input data to be received by theagent may vary in accordance with each embodiment. At step 452 once datahas been received, control proceeds to step 454 where the input data isread and then parsed at step 454. Once the input data stream has beenparsed, control proceeds to step 455 where a determination is made as towhether pattern matching is needed. If not, control proceeds to step460. It should be noted that pattern matching may not be needed, forexample, if no selective filtering of the parsed input source is neededwhen all metrics from a source are reported. Otherwise, control proceedsto step 456 where pattern matching is performed. At step 458, adetermination is made as to whether the input data has any one or morematches in accordance with predefined string values indicating events ofinterest. If not, no strings of interest are located and control returnsto step 452. Otherwise, control proceeds to step 460 where data may berecorded for the one or more metrics derived from the parsed inputsource. For example, a particular metric and its value may be stored andrecorded, for example, in the memory of a computer system upon which theagent is executing. At step 462, a determination is made as to whetherany messages reporting data to the Watch server are to be sent. Asdescribed herein, an agent may report data at periodic intervals whensummary information is reported. An embodiment may also provide forimmediate reporting the first time a designated metric increases invalue such as may be the case, for example, at the beginning of anattack or an attempted attack. This processing may be performed, forexample, by the metric aggregator and analyzer 358. If no message is tobe sent to the Watch server 50, control proceeds to step 452 to a obtainadditional data. Otherwise, control proceeds to step 464 where themessage to be sent to the Watch server is prepared in accordance with amessage protocol and/or encryption technique that may be used in anembodiment. As described herein, for example, a message being sent tothe Watch server is sent in an XML or other format and an encryptiontechnique described elsewhere herein may also be used. Control thenproceeds to step 466 where the message is sent to the Watch server.Control then returns to step 452 to wait for additional input data to bereceived by the agent.

Referring now to FIG. 8, shown is an example of an embodiment 212 ofcomponents that may be included in RTAP. It should be noted that aversion of RTAP is commercially available from Verano Corporation asdescribed, for example, at the website www.verano.com. Included in FIG.5 is RTAP scheduler 502, an alarm server 504, a Java server 506, and adatabase server 508. The database server 508 in this embodiment includesa calculation engine 510. The database server 508 may output data, suchas the metrics gathered by the agents described herein, to one or moredevices 514 which may be stored, for example, on data storage devicessuch as disks. Included in this embodiment is a memory resident portionof the database 512 used to store designated portions of the data inmemory in order to increase efficiency by reducing the amount of time ittakes to retrieve data. An embodiment may therefore designate one ormore portions of the database to be stored in a memory resident portion512.

The RTAP scheduler 502 schedules and coordinates the different processeswithin the RTAP component 212. The RTAP scheduler may perform variousprocess management tasks such as, for example, ensuring that otherprocesses in 212 are executing, scheduling different processing forexecution, and the like. The alarm server 504 may be used in connectionwith interfacing to one or more other components described elsewhereherein for notification purposes. For example, the alarm server 504 mayinterface with the notification server 216 and the threat thermostatcontroller of the Watch server 50. The alarm server 504 may be signaledin the event of a detection of a particular alert or alarm condition bythe database server 508 and may accordingly interact with componentsexternal to RTAP 212. The Java server 506 may characterized as abi-directional server communicating with the web server 32 of FIG. 4.The Java server 506 may interface with the web server 32 as needed fornotification, message sending, and other communications with RTAP 212.The Java server 506 may also output one or more inputs to the threatthermostat controller 218, and also receive input from the receiver 210to store data gathered by agents. The database server 508 may be used inconnection with storing data either on a data storage device, such as adisk 514, as well as the one or more memory resident portions of thedatabase, as may be included within memory 512. In one embodiment, thememory resident portion 512 may be implemented, for example, as a sharedmemory segment. The data stored in 512 and/or 514 may be anobject-oriented database. Prior to use, objects of the database may bedesignated for inclusion in the memory resident portion 512.

In one embodiment, write operations of the database are made to thedatabase server using the calculation engine 510. Read operations may beperformed by having another RTAP component perform the processing ratherthan reading the data through the use of the database server 508. TheRTAP component, such as the Java server, processing a read request fordata first consults the memory resident portion 512 and may obtain theone or more other portions of data from disk storage 514 as needed. Allwrite operations in this embodiment are processed through the databaseserver 508 and the calculation engine 510 is used to determine reviseddata values having dependencies on a modified data value being written.The database server 508 uses an alarm state table 520 in this embodimentto determine alarm conditions in accordance with modified data valuesstored in the database. The component 520 may be included in the disk ormemory resident portion of the database in an embodiment depending onhow the database is configured. The shared memory segments of portion512 may be stored at various time intervals to disk or othernon-volatile storage as a back up. Such a time interval may be, forexample, every 30 seconds or another time interval selected inaccordance with the particular tolerance for data loss in the event thatdata included in the memory resident portion of the database 512 islost, for example, in connection with a power or memory failure. Itshould be noted that in this embodiment, a synchronization techniquebetween readers and writers to the database may be omitted. Dataattributes and/or objects which are being written may be synchronized toprevent reading incomplete values. However, the data that is read mayalso not include all of the recently data reported. Write operations maybe synchronized by the database server 508. Thus, the database withinRTAP may operate without the additional overhead of using somesynchronization techniques.

The database objects may be represented using a tree-like structure.Referring now to FIG. 9, shown is an example of an embodiment 600 of onerepresentation of a database tree or schema that may include the objectsof the object oriented database of RTAP. In the representation 600, atlevel 0 is a root of the tree 600. A security object node and an objecttemplate node are children of the root located at level 1. The securityobject is referred to as the parent node with respect to all of therelated metrics and other data stored within RTAP. The object templatesmay include one or more nodes that correspond to the different templatesfor each of the different object types. For example, in this embodimentthere is a metric name object type, a category name object type, and ahost name object type. There are child nodes of the object templatesnode at level 1 for each of these different types. When a new host isadded to the system, an object of the type “host name” is created forthat particular new host in accordance with the appropriate objecttemplate. Each host name object corresponds to one of the hosts orcomputer systems. Data associated with each particular host or computersystem is organized by category name. A category name may refer to aparticular category of metrics. For example, one category may be logininformation having associate metrics such as number of failed passwordattempts, and the like. Each of the different metrics associated with aparticular category is a child node of a category node corresponding tothat particular category. Referring back to the category of login data,for example, the failed password attempts may be one metric stored in ametric object which is a child node with respect to the category nameobject for login information. This 3-level tree of objects is only onepossible embodiment of a database of metrics. Other embodiments that mayalso be implemented by one of ordinary skill in the art may include, forexample: conventional relational database representations of metrics, aswell as other tree structures for object-tree representations ofmetrics, such as metric objects as children of host objects withoutintervening category objects, multiple levels of category objectsproviding additional metric grouping information, and multiple levels ofobjects above the level of host objects, providing groupings for hosts,such as functional or geographic host groupings.

In this embodiment, there may be one or more object templates.Similarly, there may one or more different host objects. Associated witha single host object maybe one or more category objects. Associated witha single category object may be one or more metric objects. Each of theobjects shown in 600 may also have an associated one or more attributes.For sake of simplicity of illustration, all the attribute nodes of eachobject are not included in the tree 600. As an example, object 602 isshown in more detail in FIG. 9.

In connection with the object names included in the database schema ortree of objects, a particular metric may be referred to by including thename of all of the intervening objects in the path between the root andthe particular node of interest.

Included in FIG. 9 is a support data object having one or more childobjects which may be used to store information used primarily toconfigure standard components of the RTAP system. What will now bedescribed are how the alarm state tables used by the calculation engine,as described elsewhere herein, may be stored within the representation600. In this embodiment, one child node of the support data object is analarm class object. The children of the alarm class object correspond tothe different types of alarm classes. In this example, there are threealarm classes: Boolean or 2-state, 3-state, and 5-state. For each class,such as the Boolean class 606, a child node, such as 608, includes thealarm state table for that class. In this example, the number of statesin an alarm class specifies the number of bins or alarm levels. Anembodiment may include other alarm classes than as shown herein.

A host object may be created, for example, the first time a new agentreports to the Watch server. A new host may be determined by thereceiver 210 of FIG. 4. On first notification of a new agent, an alertor alarm condition is raised. A notification message may be sent. A useror administrator may be presented with a list of one or more new agentsand a selection may be made to authorize or reject each new agent. Whena new agent is authorized/approved, data from the agent may be processedby the receiver 210. Otherwise, data from an unauthorized/unapprovedagent is rejected. Note that the data from the agent is not stored orqueued for later processing after approval since this may cause anoverflow condition. An agent reports one or more metrics or other datato the receiver 210 which, upon successful authentication of themessage, may perform a write operation to the database of RTAP. The RTAPdatabase server 508 then writes the values to the objects and executesor runs the calculation engine in order to propagate all other valuesdependent on this new value written to the database. For example, aparticular metric, such as the number of failed password attempts, maybe referenced as a metric attribute. The first metric attribute may bethe number of failed password attempts as a raw value. A second metricattribute may be the raw value used in a mathematical representation tocalculate a percentage. Accordingly, when a new metric raw value offailed password attempts is written causing an update for the value ofthe first attribute, the calculation engine is then executed and updatesany other values in the database dependent on this new raw value. Inthis instance, a revised percentage is determined as a revised secondattribute value.

In this embodiment, the calculation engine 510 has a built-in alarmfunction which uses values from the alarm state table 520 to determineif a revised data value triggers an alarm condition. After writing a newdata value to the database, the calculation engine determines reviseddata values as described above in accordance with data dependencies.Once the revised data values have been determined, alarm state table 520may be consulted to determine if any of the revised values now trigger anew or revised alarm condition. In the event that the calculation enginedetermines that an alarm condition has been detected, a message is sentfrom the database server 508 to the alarm server 504 which subsequentlysends a message to the one or more notification servers.

In one embodiment, an alarm state vector or alarm instance may bedefined for an attribute of a metric object. In determining a value forthis attribute, the alarm function described above may be invoked.

Referring now to FIG. 9A, shown is an example 620 of more detail of anode in the tree 600 for a metric using the alarm function. In thisexample, the attribute 1 622 has an associated alarm instance 624 and analarm function whose result is assigned as the value of the attribute1622. The alarm instance 624 includes one or more subvalues 628 a-628 cthat may be used by the alarm function. In this example, the subvaluesinclude a current alarm state 628 a, a current acknowledged state (Ackstate) 628 b, and a sequence number 628 c. It should be noted that otherinformation may be included as one or more subvalues than as describedherein in this example. Use of these subvalues 628 a-628 c is describedin more detail in following paragraphs. In one embodiment, the subvaluesmay be included in a vector or other data structure.

In one embodiment, the alarm function may have a defined API of thefollowing format:

Alarm (alarmclass, value, limits, other data . . . )

The limits in the above refer to the alarm limits vector, as describedelsewhere herein. The alarm limits vector may include one or more levelsassociated with the different thresholds. Each level in the alarm limitsvector above refers to an alarm level or threshold that may beassociated with varying degrees of alarm conditions or severity levelssuch as, for example, warning, high, and the like. Each of these levelsmay be stored within another attribute, such as attribute 2 of FIG. 9Aand may have a default value as specified in the original template.These values may be changed in an embodiment, for example, through auser specifying or selecting a new threshold level. The alarmclass maybe used to specify a particular class of alarm (such as 2-, 3-, or5-state) in order to determine the proper alarm class from the tree 600to obtain the corresponding alarm state table for a particular alarmclass.

It should be noted that state tables, as may be used in connection withalarm states, are known to those of ordinary skill in the art and mayinclude one or more rows of input. Each row may specify a next state andaction(s) based on a current state and given input(s). Using the alarmstate table as input, the built in alarm function of the calculationengine may determine, in accordance with the revised data values,whether a predefined level associated with an alarm condition has beenexceeded. The predefined levels may be default or user-specified alarmthresholds.

The foregoing is an example of how data gathered by an agent may beprocessed and stored within an embodiment of the Watch server includingRTAP. Other embodiments may use other representations in accordance withthe particular metrics and communication protocols used in anembodiment.

The RTAP component 212 may be characterized as described herein in oneembodiment as an environment which is a set of cooperating processesthat share a common communications infrastructure. In one embodiment,these processes may communicate using SysV UNIX messaging techniques,semaphores, shared messages, and the like. As known to those of ordinaryskill in the art, an embodiment using SysV messaging techniques mayexperience problems due to insufficient memory allocated for messageuse, such as with RTAP communications. Messages that may be communicatedbetween processes within RTAP, as well as between RTAP and othercomponents, may use a prioritization scheme in which low priorityactivities involving message sending are suspended when the amount ofmemory in a message pool falls below a designated threshold. Thisparticular designated threshold may vary in accordance with eachparticular embodiment. In connection with the techniques describedherein, a portion of the memory for message use, such as 80%, may bedesignated as a maximum threshold for use in connection with requests.In the event that more than 80% of the message memory or pool has beenconsumed and used in connection with message requests, any new requestsare blocked until conditions allow this threshold not to be exceeded.However, message responses are processed. The foregoing may be used toavoid a deadlock condition by blocking a request in the event that thethreshold portion of the message pool is consumed for use in connectionwith requests. In one embodiment, the foregoing special messagemanagement functionality may be included in one or more routines orfunctions of an API layer used by the different RTAP components whenperforming any messaging operation or function. These routines in theAPI layer may then invoke the underlying messaging routines that may beincluded in the operating system.

An embodiment may utilize what is referred to herein as “latchingalerts” where a particular alarm level does not decrease until anacknowledgment of the current alarm state has been received. Anacknowledgment may be made, for example, by an operator through a GUI.An embodiment may define an alarm state table 520 such that an alarm oran alert state may be raised or remain the same until an acknowledgementof the alarm or alert state has been received. Until an acknowledgmentis received, the alarm state table does not provide for reducing thealarm or alert state. It should be noted that the foregoing latchingalerts may be performed in connection with one or more of those metricsassociated with an alert or an alarm state. The latching alerts may beused in an embodiment in connection with particular indicators orclasses. Other classes of metrics, such as those associated withperformance indicators, may not be subject to the latching condition.This may vary in accordance with each embodiment.

Referring now to FIG. 10, shown is an example of an embodiment of thealarm state table 520 that may be used in connection with implementinglatching alerts. Included in the state table 520 is a line ofinformation corresponding to a current level or state. Each line ofinformation includes a current level or state, an acknowledgment, andinput value or range, a next level or state, and an associated action.In this example 520, a normal level is associated with a level oneindicator for a range of input values between 0 and 100, inclusively. Analarm condition is associated with a second level for a range of inputvalues between 101 and 200, inclusively. Finally, a third alarm level isassociated with an input range of values from 201 to 300, inclusively.Line 652 indicates that the current level or state of normal level oneis maintained as long as the input is between the range of 0 and 100.Line 654 indicates that when the current level is normal (level 1) and acurrent input value is between the range of 101 to 200, level 2 is thenext designated level or alarm condition. The action designates that analarm condition is generated. In connection with line 652 and 654, it isnot relevant whether an acknowledgment has been received because anacknowledgment does not apply in this example in connection with anormal alarm or level condition. Referring to line 656, when the systemis in the second level of alarm and the input value drops down to thenormal range between 0 and 100, but an acknowledgement of the alarmcondition has not yet been received with respect to the level 2 alarmcondition, the system remains in the level 2 state of alarm. Withreference to line 658, in the event that the system is in the secondlevel of alarm and acknowledgement of this alarm has been received, whenthe input value or range drops to the normal range between 0 and 100,the level of the current state decreases to level 1 and the alarmcondition is cleared. Referring to 660, if the system is in the secondlevel of alarm state and the input value increases to the third levelbetween 201 and 300, an alarm condition is generated to signify thisincrease in alarm level to level 3. This is performed independent ofwhether an acknowledgement to the previous alarm state of level 2 hasbeen acknowledged.

It should be noted that the different ranges or values specified inconnection with the third column of 520 may be associated with thresholdvalues or ranges. The thresholds may be specified using default valuesas well as in accordance with one or more user selected values orranges. It should also be noted that although the table 520 showsspecific numeric values for the ranges in the input column, these alarmrange thresholds may be parameterized to use the input values (i.e.,alarm LEVELs) of the alarm function described elsewhere herein.

Referring now to FIG. 11, shown is an example of another embodiment ofan alarm state table 700 and an alarm limits vector 200. The elements ofvector 200 identified in the lower left corner may be passed as inputparameters when invoking the alarm function described herein specifyingthe LEVELs or thresholds for the different alarm states as describedabove. In this example, the table 700 represents a 3-state alarm(normal, warning, and alert) with 2 thresholds (100 and 200) formingthree partitions or ranges (0-99, 100-199, 200 and greater). Each row ofthe table 700 corresponds to a state and includes:

a state identifier for the row in the table 704;

a named identifier for the state 706;

a color 708 as may be displayed, for example, by an illuminated light ona display panel or other indicator; and

an indication as to whether an acknowledgement (ACK) is required forthis state 710.

The portion 702 of this example includes the transition functions(beginning with & in this example) used in determining state transitionsfrom one row of the table to another. Other embodiments of 700 mayinclude information other than as described herein. State transitionsoccur as a result of evaluating transition functions. It should be notedthat if a column name contains a space character (such as between RANGEand LEVEL in 712), the transition function name ends at the spacecharacter such that the remaining text (LEVEL) following the transitionfunction (RANGE) is a modifier to the function, telling the function howto interpret the values in the column.

The alarm system determines the new state for an alarm by repeatedlyevaluating transition functions in the context of the alarm state tablefor the alarm. The transition functions are evaluated in the order inwhich they appear from left to right as columns in the state table. Eachtransition function may take optional input parameters such as, forexample, the current alarm state, values from the alarm state table andother optional values as arguments. As an output, the transitionfunction returns a new state in accordance with the input parameters.Each function is evaluated repeatedly, until the new state returned isthe same as indicated by the state value in 704 for a particular row, oruntil a transition that would loop is detected. Evaluation then proceedsto the transition function indicated by the next column moving from leftto right.

In this example 700, two functions are illustrated, &ACK and &RANGE asdescribed in more detail in following paragraphs. The &RANGE functiontakes an alarm limit vector like the one illustrated in FIG. 11, lowerleft corner 720, as an example, as well as the following and otherpossible suffixes in column names:

level—The level number corresponding to the current alarm state;

<—What new state to move to when the current value for the metric is ina range of values lower than the range for the current alarm state;

=—What new state to move to when the current value for the metric is inthe range of values associated with the current alarm state; and

>—What new state to move to when the current value for the metric is ina range of values higher than the range associated with the currentalarm state.

The alarm limit vector 720 in this example contains an integer (2 inthis example) as the first vector element indicating what level numberto assign to the highest range of values the metric can assume. Thehighest range of values includes all values greater than or equal to thelimit specified as the second vector element, which is 200 in thisexample. The third vector element specifies another next-lower limit,which is 100 in this example. The fourth vector element specifies thefinal next-lower limit, which is 0 in this example. The three ranges orpartitions of values are specified using the vector elements 2-4 for thefollowing:

Range 0: values less than 100

Range 1: values from 100 to 199

Range 2: value 200 or more.

In connection with the foregoing, it should be noted that a fourth rangeis implied for values less than zero (0). In this example, values inthis fourth implied range correspond to an error state and are notfurther described in connection with this example.

In this example, the &ACK function is a transition function that returnsthe current state (as indicated by column 704) when the alarm has notyet been acknowledged, otherwise returns the new state as indicated incolumn 714 when the alarm has been acknowledged.

Referring back to FIG. 9A, the current alarm state 628 a and the ackstate 628 b may be used in determining the current state of the alarmand its corresponding row in the alarm state table. The sequence number628 c may be used in connection with race conditions that may beassociated with alarm conditions and acknowledgments thereof. A uniquesequence number is associated with each alarm state that caused anotification message to be sent to a user. Each such message contains acopy of that sequence number for the state causing the message to besent. In one embodiment, a unique sequence number may be generated foreach alarm condition associated with a single invocation of the alarmfunction. The single invocation of the alarm function may result intransitioning from a current input state to an output state, and mayalso transition through one or more other intermediate states to arriveat the output state. In this embodiment, a unique sequence number is notassociated with the one or more intermediate states. Rather, a firstunique sequence number is associated with the current input state and asecond unique sequence number is associated with the output state. Forexample, a first alarm condition notification message having a firstsequence number may be sent to a user and indicated on a user interfacedisplay. A second alarm condition, indicating a greater severity thatthe first alarm condition and having a second sequence number, may bedetermined during the same time period in which a user's acknowledgementof only the first alarm condition is received and processed. The user'sacknowledgment is processed in accordance with the particular sequencenumber associated with the alarm condition being acknowledged. In thisexample, the acknowledgement indicates an acknowledgement of the firstalarm condition, but not the second. Without the use of the sequencenumber, or some other equivalent known to those of ordinary skill in theart, the acknowledgment may also result in acknowledgement of the secondalarm condition depending on whether the acknowledgement is processedbefore or after the second alarm condition is determined.

Referring now to FIG. 12, shown is a state transition diagram 800illustrating the state transitions associated with the functions 702 ofalarm state table 700. In the example 800, each state is represented asa node. The arrows between the nodes represent the transitions possiblebetween the different states in accordance with the information in 702of table 700. Note that the diagram does not indicate transitionscausing a same state or a transition to a state A from a same state A.

What will now be described is an example of acknowledging an alarmillustrating the use of the table 700 in determining a new state from acurrent state. In an example, the alarm function is invoked includingparameters indicating that the current state or initial state is AlertUnacked (6) with a metric value of 250. A human user or other agentacknowledges the alarm, causing the alarm state to be re-evaluated.Examining the row of table 700 for state 6, the &ACK function isevaluated with a state (6) as the current state. Since the alarm is nowacknowledged, &ACK returns (5) as indicated in the &ACK column of row 6of the table as the new state. As a result, the new state of the alarmbecomes Alert Acked (5). &ACK is re-evaluated for state 5 using row 5 ofthe table 700. Since the alarm has been acknowledged, the &ACK functionreturns a 5 as the new state. Since the new state matches the currentstate, the evaluation of the &ACK function is complete and evaluationproceeds with the next transition function in state 5. The nexttransition function, is &RANGE. Recall that the metric value for whichevaluation is being performed is 250. &RANGE uses the limits vector asdescribed: above, and determines that the current metric value of 250 isgreater than the first limit of 200 classifying the current metric valueas being within the highest range of 2 (greater than 200). The “&RANGElevel” column indicates that range (2) corresponds to the current state(5) and so the &RANGE function returns the (5) which is contents of row5 in the “&RANGE=” column as the new state. Since the new state (5) isidentical to the current state (5), evaluation of the &RANGE function iscomplete. Since the &RANGE function is the last transition function inthe state table in this example, the state change is complete and thealarm system proceeds to carry out whatever actions are configured forthe new state and any other notifications or other controls that may beassociated with the alarm. For example, the alarm color in the examplechanges from “Red Flashing” to “Red” as indicated by column 708. Column708 may be characterized as an associated action to be taken when atransition completes after evaluation of all transitions functions.Other embodiments may include other actions.

Similar examples can be constructed to demonstrate that the state tableillustrated in FIG. 11 embodies the following behavior for a latchedalert or alarm:

When a metric value enters a value range associated with a “higher” or“more serious” alert state than the current state, the current statetransitions to that higher alert state, unacknowledged.

If the metric enters a value range associated with a “lower” or “lessserious” alert state, and the alarm has not been acknowledged, no statechange takes place—the alarm is said to “latch” at the highest alertstate represented by the metric while the alarm is unacknowledged.

If the alarm is acknowledged, the state changes to whatever statecurrently reflects the value of the alarm metric.

The state table in FIG. 11 illustrates an alarm that, if a metricassociated with the alarm assumes a value corresponding to a lower alarmstate while the alarm is latched at a higher state, and the alarm isacknowledged, the alarm transitions into the lower alarm state with anunacknowledged status. If the alarm is in a high-severity acknowledgedstate, and the underlying metric changes to reflect a lower-severitystate, the alarm also changes to a lower-severity unacknowledged state.Comparable alarm tables can be constructed that preserve the latterbehavior of the alarm while altering the former behavior to transitioninto an acknowledged state, rather than an unacknowledged state. Suchtransition tables however, may be characterized as more complex than theexample described herein and may include additional states than asillustrated in FIG. 11. The table 700 of FIG. 11 models an “analog” or“floating point” metric. Comparable state tables can be constructed fordigital metrics, boolean and other kinds of metrics.

It should be noted that an embodiment of the alarm state table mayutilize values to implement hysteresis for one or more of the ranges.Hysteresis may be characterized as a behavior in which a metric valuemay transition from a first state into a second state in accordance witha first threshold value, and transition from the second state to thefirst state in accordance with a second threshold value that differsfrom the first threshold value. Such threshold values may be used inconnection with a metric that may assume values at different points intime which hover near a boundary or range threshold. The use of the twodifferent thresholds may be used in order to reduce constantly changingstates for metric values hovering near a boundary condition. Forexample, a range threshold may be 100. It may be that the associatedmetric assumes a series of values: 98, 99, 101, 99, 101, 99, 102, etc.at consecutive points in time. The range threshold may be used to causea transition from a first state to a second state (from 99-101).However, it may be undesirable to have an alarm state change associatedwith changes from 101 to 99 especially if the metric value may hoveraround the boundary of 100. An embodiment may determine that once thefirst threshold is reached causing a transition from a first range under100 to a second range of 100 or more, a value of 95 or less is needed tocause a transition from the second back to the first range using 95 asthe second threshold value. As will be appreciated by those of ordinaryskill in the art, state tables as described herein may be modified toinclude the foregoing use of multiple thresholds to account forhysteresis conditions.

The foregoing is one example of an embodiment of how data may be managedwithin the system and how alarm conditions may be determined andgenerated. Other embodiments may store the data in one or more othertypes of data storage in accordance with one or more organizations anddetermine alarm conditions using techniques other than as describedherein.

The XML messages or documents sent by the agents to the receiver mayinclude data corresponding to objects and the tree-like structure asillustrated in FIG. 9. The XML document may include data represented as:host name name of host sending the report category name name of a groupof metrics metric name name of a metric value metric value attr1 otherattribute/value . . .

It should be noted that an embodiment may use other formats, such as analternative to XML, and protocols than as described herein forcommunications between agents and other components.

Attributes that may be associated with a metric include “value” and“units.” Other kinds of metrics have other attributes. For example, the“Operating System Type” metric may have corresponding attributes uniqueto every platform. The attributes may include, for example, a versionnumber, machine ID or other information that may be useful on thatplatform, but not exist on any other platform.

Other groups of metrics may share some additional “standard” or commonattributes, for example:

log—used by the log agent 306 for all metrics derived from log files.The “log” attribute is a table of strings containing the complete textof the first three log messages of the type counted by the metric in thereporting interval. These log messages may provide additional detail,for example, to a network administrator who is trying to understand whyan alert was raised on a particular metric; and

summary—used by all agents that support the enhanced e-mail or enhancedreporting feature as described elsewhere herein. The “summary” attributecontains a human-readable summary of the “next level of detail” for themetric.

As described herein, data from the RTAP database may be combined usingthe calculation engine and elements from the tree-structure objectoriented database to produce one or more inputs to the threat thermostatcontroller 218 previously described herein. The calculation engine asdescribed above may process a data-flow language with a syntax andoperation similar to a spreadsheet. A calculation engine expression maybe defined as expressions attached to data attributes of objects in thedatabase. When the calculation engine processes an object, all of theexpressions in the object are evaluated and become the new values of theattributes to which they are attached. This evaluation process isrepeated to re-evaluate all objects dependent on changed values.

Expressions in an embodiment may reference other attributes using arelative pathname syntax based on the Apple Macintosh filesystem namingconventions. For example,

‘ˆ’ means the parent object

‘:’ separates object names

‘.’ separates an object path from an attribute name

In one example, there may be 5 external threat thermostat valuescommunicated by the threat agent 200 and stored in the RTAP databasecalled E1, E2, E3, E4, and E5. An input to 218 may be determined as aweighted average of the foregoing five values. The threat agent 200 maymonitor or poll the external data sources and write the threatthermostat level indicators to the five “E” external points. In the RTAPdatabase, these five (5) points may be defined as child objects of aparent object representing the combined weighted average in anexpression. The value of this expression may be assigned to the parentobject having the following expression: ([E1.indicator] + 3 *[E2.indicator] + [E3.indicator] + [E4.indicator] + [E5.indicator])/7In the foregoing, the “.indicator” operator obtains the value of theidentified attribute referenced. In the foregoing, external indicator 2is determined to be three times as valuable or relevant as the otherindicators. Whenever an indicator is updated, the calculation enginecalculates the tree of other points having expressions referencing thecontents of any of the attributes of a changed point. The engine isexecuted to evaluate the expressions on all of those objects, similar tothat of a spreadsheet. An embodiment may use any one or more knowntechniques in evaluating the expressions in an optimal order.

In one embodiment, an approach may be taken with respect to combininginputs with respect to the different metrics as may be reported by thedifferent agents. A resulting combination may be expressed as a derivedparameter used, for example, in connection with generating an alarm oralert condition, as an input to the threat thermostat 218, and the like.A derived value or signal indicating an attack may be produced byexamining one of more of the following: one or more metrics from theNIDS agent, the ARPWatch agent, IPS, the number of bad logins and rootuser count exceeding some predetermined normal threshold tuned for thatparticular system. Once an initial set of one or more of the foregoingindicate an alert condition indicative of an attack or attempted attack,secondary condition(s) may be used as confirmation of the attackindication. The secondary information may include a resource usage alerton some machine that occurs simultaneous with, or very shortly after, areliable indication of an attack on that machine, or on the network atlarge using the initial set of conditions. An embodiment may, forexample, generate an alarm condition or produce an input to the threatthermostat 218 based on the foregoing. An alarm condition may begenerated in connection with a yes or true attack indicator value basedon the first set of one or more conditions. Once this has occurred,another subsequent alert or alarm condition may also be generated basedon the occurrence of one or more of the second set of conditionsoccurring with the first set of conditions, or within a predeterminedtime interval thereof, for the network, or one or more of the samecomputers.

In connection with the foregoing, resource usage metrics may not be usedas first level attack indicators or used without also examining otherindicators. Resource usage may be characterized as a symptom ofpotential machine compromise rather than an attack on the machine. Usagemetrics may be “noisy” causing false positive indicators of an attack ifexamined in isolation. However, if such an alert occurs in connectionwith resource usage simultaneously with, or very shortly after, anotherreliable “attack” indicator (as in the initial or first set ofindicators above), the resource usage metric's credibility increases.Accordingly, the resource usage metrics may be consulted in combinationwith, or after the occurrence of, other attack indicators to determine,for example, if any one or more particular computers have beencompromised. Based on the foregoing, an embodiment may define a derivedparameter using an equation or formula that takes into account securitymetrics and combines them with one or more resource metrics. Such aderived parameter may be used, for example, in connection with producingan input to the threat thermostat controller. Such a derived parametermay be produced using the weighting technique described above.

An embodiment may include a form of enhanced reporting or notificationas made by the Watch server to a user upon the detection of an alarmcondition. As described herein, metrics and associated information maybe reported by an agent. The values of the metrics may be one or morefactors used in determining an alarm condition. The values of themetrics used to detect and report the occurrence of an alarm conditionmay be characterized as a first level of alarm or event notificationinformation. Once an alarm condition has been detected, additionalinformation that may have been gathered by the agent may also be usefulin proceeding to take a corrective action or further diagnosing aproblem associated with the alarm condition. This additional informationthat may be used in further diagnosing the problem or taking correctiveaction may be characterized as a second level of information.

An embodiment may provide an option for enabling/disabling notificationsof alarm conditions to include this additional information. Theadditional information may be reported by the agents to the Watch serverfor optional incorporation into notification messages. An enable/disableoption may also be associated with agents gathering the data. Whether anembodiment includes this feature or uses it for selective metrics mayvary with each embodiment and its resource limits such as, for example,of the industrial network. Thus, the cost and feasibility of obtainingthe second level of information may be balanced with the benefits to begained in taking corrective actions using the second level ofinformation from an alert message rather than obtaining the informationsome other way. For example, if a second level of information is notincluded in an alert notification, an administrator may use a userinterface connected to the Watch server 50 to gain the additionalinformation.

It should be noted that the particular additional information includedin the second level of enhanced notification may vary with each metric.It may include additional information to assist in determining a problemsource such as a user account, IP or other address, particularcomponent(s) or process(es) consuming a resource and the associatedpercentage(s), and the like. Some of this second level of information isdescribed above with associated metrics. In one embodiment, the enhancednotification feature may be optionally enabled for use with one or moremetrics of the SNMP Guard agent 203 and the Guard log agent 209described herein. For example, the agent 203 may report the followingadditional information with each of the metrics for enabled enhancedreporting:

Communications status—In the event there is a problem indicated by thismetric, corrective action includes determining the reason for thecommunication failure. A message such as the component is notresponding, its IP address and the time of the failure may help.

Login failures—Identify the most frequent type of connection, such as aVPN, remote dial-in, or other connection, a percentage of the failureson this connection, one or more of the user IDs associated with the topnumber of failures and the percentage and originating IP address orother communication channel associated with each user ID.

Administrative user count, dialup user count VPN user count—identify theIP or other addresses from which the most recent administrative usershave logged on.

Memory usage, CPU usage, disk space, other resource usage—The topconsumers of the resource are identified along with an associatedpercentage along with which process or user or other information toidentify the consumer as appropriate for the resource.

Open session count—identifies the number of open communication sessionsbetween any two points. Additional information may include two or moreIP addresses, host names, and the like, identified as being included asa connection endpoint.

Agent 209 may include the following additional enhanced reporting ornotification information for the following metrics:

Configuration—In the event that there has been a change to theconfiguration as monitored by 209, it may be helpful to know whatchanged such as whether there is a change to the threat thermostat rulesets, such as included in 220, and/or the current firewallconfiguration, and what portion of the rules changed. Additionally, forany such change, what was the state change (previous state to whatcurrent state), from what user account, address, process and the like,made this change.

Threat thermostat change—An embodiment may indicate an alarm conditionwhen a change occurs to the threat thermostat setting. The change may bethe result of a manual change, an automated change in accordance withthe functionality included in an embodiment. Additional detail forenhanced reporting may include what user made the change, what was thestatus changed to/from, the frequency that such changes have been madewithin a reporting period, identify the uses that most frequentlychanged the setting and what percentage of the time each user changedthe setting.

NIDS and UPS reports—An address or other identifying source of the mostfrequent alerted NIDS/IPS conditions, an associated percentage of theseconditions attributed to a particular source, information about the typeof attack, and the target of the attack (what machine by host name, IPaddress and the like).

Antivirus events—The metric may identify a total number of antivirusevents. Additional information may include a break down by type of eventwithin a reporting period to identify what viruses (virus signatures)have been removed from a communication streams with an associatedfrequency or percentage, what source and/or destinations (such as sourceand destination networks) appeared most frequently for each type, and afrequency or percentage associated with each of the source anddestinations.

Other activity—This metric identifies other activity that does notbelong in any other category. Additional information may include thetext of the first one or more messages of this type detected.

Referring now to FIG. 13, shown is an example 900 of a graphical userinterface display. The example 900 may be displayed, for example, usinga web browser to view alarm incident reports resulting from notificationmessages sent in accordance with alarm conditions determined. Theexample 900 may be used to view and acknowledge one or more of the alarmconditions in an embodiment. In one embodiment, this display of 900 maybe viewed when tab 902 for incidents is selected from one of multipleuser interface tabs. Each tab may be selected in connection withdifferent viewing and/or processing. In one embodiment, the tab 902 mayflash if a new alarm condition is detected from that which is displayedin 900 at a point in time to a user. In other words, this embodiment maynot automatically update the display 900 with additional information foralert conditions detected since the user selected tab 902. Rather, thisembodiment flashes coloring on tab 902 to indicate such additional alertconditions detected while the user is in the process of using aninstance of display 900. The inventors believe that the flashing tab isless disruptive of user concentration during alarm burst conditions thanother notification techniques such as, for example, redrawing thedisplay 900 with updated alarm information as available.

The display 900 may indicate in column 906 (labeled “A”) whether aparticular condition indicated by a line of displayed data has beenacknowledged. An incident or alarm condition associated with a line ofdisplayed data in 900 may be acknowledged, as by selecting theexclamation point icon to the left of a particular line of data,selecting the option 908 to acknowledge all displayed incidents, or someother option that may be provided in an embodiment. The status in 906for each incident may be updated in accordance with useracknowledgement. For example, 904 indicates that the associated incidenthas not been acknowledged (e.g., exclamation point notation in column906). The two incidents as indicated by 910 have been acknowledged(e.g., no exclamation point notation in column 906).

Referring now to FIG. 14, shown is an example of a user interfacedisplay 1000. The example 1000 may be displayed when the monitor tab1020 is selected to view a metric tree. With reference to FIG. 9, theinformation displayed in 1000 is that information included in a portionof 600—the subtree formed with the security object as its root includingall child nodes. The display 1000 shows an aggregate view of thedifferent metrics and associated alarm conditions. The display 1000reflects the hierarchical representation in the subtree by showing anesting of hosts (Guard and Watch), categories for each host (such asIntrusion attempts, Resource Usage, and the like), and metrics (such asCPU Usage, Memory Usage and Sessions) associated within each category(such as Resource Usage). In the subtree, these metrics may be definedas leaf nodes having a parent node (category name) defined as ResourceUsage.

Associated with each of the metrics is a level indicator. The levelindicator may indicate a color or other designation associated uniquelywith each alarm state within an embodiment. For example, in oneembodiment, the indicator may be green when the metric level is in thenormal range, yellow when the metric level is in the warning range, andred when in the highest severity range.

The elements in 1000 representing the parent node of one or more othernodes may have a color or other designation corresponding to theaggregate condition of all the child nodes. For example, the indicatorfor Resource Usage may represent an aggregate of each of the indicatorsassociated with metrics for CPU Usage, Memory Usage, and Sessions. Inone embodiment, the aggregate indicator of a parent node may bedetermined to be the maximum indicator value of its one or more childnodes. For example, a parent node indicator, such as 1006, is yellow ifany one or more of its child node indicators, such as 1008, are yellowbut none are red.

In 1000, the user may select to view a graph of a particular metric inthe right portion of 1000 by selecting an option in the left portion of1000 as indicated by selection 1010. In one embodiment, it should benoted that the graph portion is not immediately updated in response to auser selection. Rather, the graph may be updated when the web page isrefreshed in accordance with the one or more selections made at thatpoint in time. Note that in 1000 the icon or indicator displayed forWatch 1002 has a different shape than other machines, such as the Guardmachine or host. The different shape makes it easier for users to findthe Watch server in the list of hosts since many of the importantnetwork monitoring metrics are found in the Watch server branch of themetric tree.

The foregoing 900 and 1000 are examples of user interface displays thatmay be included in an embodiment and displayed, such as using a webbrowser on the web server 214 of FIG. 4. Other embodiments may userother interface displays than as described herein.

While the invention has been disclosed in connection with preferredembodiments shown and described in detail, their modifications andimprovements thereon will become readily apparent to those skilled inthe art. Accordingly, the spirit and scope of the present inventionshould be limited only by the following claims.

1. A method for controlling connectivity in a network comprising:receiving one or more inputs; determining a threat level indicator inaccordance with said one or more inputs; and selecting, for use in saidnetwork, a firewall configuration in accordance with said threat levelindicator.
 2. The method of claim 1, wherein said firewall configurationis selected from a plurality of firewall configurations each associatedwith a different threat level indicator.
 3. The method of claim 2,wherein a first firewall configuration associated with a first threatlevel indicator provides for more restrictive connectivity of saidnetwork than a second firewall configuration associated with a secondthreat level indicator when said first threat level indicator is ahigher threat level than said second threat level indicator.
 4. Themethod of claim 3, wherein, a firewall configuration associated with ahighest threat level indicator provides for disconnecting said networkfrom all other less-trusted networks.
 5. The method of claim 4, whereinsaid disconnecting includes physically disconnecting said network fromother networks.
 6. The method of claim 4, wherein said network isreconnected to said less trusted networks when a current threat level isa level other than said highest threat level indicator.
 7. The method ofclaim 1, further comprising: automatically loading said firewallconfiguration as a current firewall configuration in use in saidnetwork.
 8. The method of claim 1, wherein said one or more inputsincludes at least one of: a manual input, a metric about a system insaid network, a metric about said network, a derived value determinedusing a plurality of weighted metrics including one metric about saidnetwork, a derived value determined using a plurality of metrics, and anexternal source from said network.
 9. The method of claim 8, wherein, ifsaid manual input is specified, said manual input determines the threatlevel indicator overriding all other indicators.
 10. The method of claim8, wherein said plurality of weighted metrics includes a metric about atleast one of: a network intrusion detection, a network intrusionprevention, a number of failed login attempts, a number of users with ahigh level of privileges.
 11. The method of claim 10, wherein said highlevel of privileges corresponds to one of: administrator privileges androot user privileges.
 12. The method of claim 1, wherein said selectingadditionally selects one or more of the following: an antivirusconfiguration, an intrusion prevention configuration, and an intrusiondetection configuration.
 13. A computer program product for controllingconnectivity in a network comprising code that: receives one or moreinputs; determines a threat level indicator in accordance with said oneor more inputs; and selects, for use in said network, a firewallconfiguration in accordance with said threat level indicator.
 14. Thecomputer program product of claim 13, wherein said firewallconfiguration is selected from a plurality of firewall configurationseach associated with a different threat level indicator.
 15. Thecomputer program product of claim 14, wherein a first firewallconfiguration associated with a first threat level indicator providesfor more restrictive connectivity of said network than a second firewallconfiguration associated with a second threat level indicator when saidfirst threat level indicator is a higher threat level than said secondthreat level indicator.
 16. The computer program product of claim 15,wherein, a firewall configuration associated with a highest threat levelindicator provides for disconnecting said network from all otherless-trusted networks.
 17. The computer program product of claim 16,wherein said code that disconnects includes physically disconnectingsaid network from other networks.
 18. The computer program product ofclaim 16, wherein said network is reconnected to said less trustednetworks when a current threat level is a level other than said highestthreat level indicator.
 19. The computer program product of claim 13,further comprising code that: automatically loads said firewallconfiguration as a current firewall configuration in use in saidnetwork.
 20. The computer program product of claim 13, wherein said oneor more inputs includes at least one of: a manual input, a metric abouta system in said network, a metric about said network, a derived valuedetermined using a plurality of weighted metrics including one metricabout said network, a derived value determined using a plurality ofmetrics, and an external source from said network.
 21. The computerprogram product of claim 20, wherein, if said manual input is specified,said manual input determines the threat level indicator overriding allother indicators.
 22. The computer program product of claim 20, whereinsaid plurality of weighted metrics includes a metric about at least oneof: a network intrusion detection, a network intrusion prevention, anumber of failed login attempts, a number of users with a high level ofprivileges.
 23. The computer program product of claim 22, wherein saidhigh level of privileges corresponds to one of: administrator privilegesand root user privileges.
 24. The computer program product of claim 13,wherein said code that selects additionally selects one or more of thefollowing: an antivirus configuration, an intrusion preventionconfiguration, and an intrusion detection configuration. 25-174.(canceled)